WEBVTT

00:00:06.971 --> 00:00:10.106
Okay, so welcome to lecture two of CS231N.

00:00:10.106 --> 00:00:12.646
On Tuesday we, just recall,
we, sort of, gave you

00:00:12.646 --> 00:00:15.098
the big picture view of
what is computer vision,

00:00:15.098 --> 00:00:16.187
what is the history,

00:00:16.187 --> 00:00:18.101
and a little bit of the
overview of the class.

00:00:18.101 --> 00:00:20.340
And today, we're really going
to dive in, for the first time,

00:00:20.340 --> 00:00:21.451
into the details.

00:00:21.451 --> 00:00:23.339
And we'll start to see,
in much more depth,

00:00:23.339 --> 00:00:25.752
exactly how some of
these learning algorithms

00:00:25.752 --> 00:00:27.864
actually work in practice.

00:00:27.864 --> 00:00:29.089
So, the first lecture of the class

00:00:29.089 --> 00:00:31.927
is probably, sort of, the
largest big picture vision.

00:00:31.927 --> 00:00:33.808
And the majority of the
lectures in this class

00:00:33.808 --> 00:00:35.574
will be much more detail orientated,

00:00:35.574 --> 00:00:37.298
much more focused on
the specific mechanics,

00:00:37.298 --> 00:00:39.617
of these different algorithms.

00:00:39.617 --> 00:00:41.148
So, today we'll see our
first learning algorithm

00:00:41.148 --> 00:00:43.419
and that'll be really exciting, I think.

00:00:43.419 --> 00:00:44.822
But, before we get to that,

00:00:44.822 --> 00:00:47.535
I wanted to talk about a couple
of administrative issues.

00:00:47.535 --> 00:00:48.785
One, is Piazza.

00:00:49.645 --> 00:00:51.580
So, I saw it when I checked yesterday,

00:00:51.580 --> 00:00:54.168
it seemed like we had maybe 500 students

00:00:54.168 --> 00:00:55.357
signed up on Piazza.

00:00:55.357 --> 00:00:56.839
Which means that there
are several hundred of you

00:00:56.839 --> 00:00:58.493
who are not yet there.

00:00:58.493 --> 00:01:01.457
So, we really want Piazza
to be the main source

00:01:01.457 --> 00:01:04.221
of communication between the
students and the core staff.

00:01:04.221 --> 00:01:07.401
So, we've gotten a lot of
questions to the staff list

00:01:07.401 --> 00:01:11.170
about project ideas or questions
about midterm attendance

00:01:11.170 --> 00:01:12.841
or poster session attendance.

00:01:12.841 --> 00:01:14.389
And, any, sort of, questions like that

00:01:14.389 --> 00:01:16.138
should really go to Piazza.

00:01:16.138 --> 00:01:18.134
You'll probably get answers
to your questions faster

00:01:18.134 --> 00:01:21.568
on Piazza, because all the
TAs are knowing to check that.

00:01:21.568 --> 00:01:23.181
And it's, sort of, easy
for emails to get lost

00:01:23.181 --> 00:01:26.234
in the shuffle if you just
send to the course list.

00:01:26.234 --> 00:01:28.983
It's also come to my attention
that some SCPD students

00:01:28.983 --> 00:01:33.629
are having a bit of a hard
time signing up for Piazza.

00:01:33.629 --> 00:01:35.458
SCPD students are supposed to receive a

00:01:35.458 --> 00:01:37.791
@stanford.edu email address.

00:01:38.804 --> 00:01:40.733
So, once you get that email address,

00:01:40.733 --> 00:01:44.276
then you can use the Stanford
email to sign into Piazza.

00:01:44.276 --> 00:01:45.857
Probably that doesn't
affect those of you who are

00:01:45.857 --> 00:01:46.908
sitting in the room right now,

00:01:46.908 --> 00:01:50.408
but, for those students listening on SCPD.

00:01:52.191 --> 00:01:55.643
The next administrative issue
is about assignment one.

00:01:55.643 --> 00:01:58.178
Assignment one will be up later today,

00:01:58.178 --> 00:01:59.517
probably sometime this afternoon,

00:01:59.517 --> 00:02:01.653
but I promise, before
I go to sleep tonight,

00:02:01.653 --> 00:02:03.043
it'll be up.

00:02:03.043 --> 00:02:04.589
But, if you're getting a little bit antsy

00:02:04.589 --> 00:02:07.304
and really want to start
working on it right now,

00:02:07.304 --> 00:02:09.297
then you can look at last year's version

00:02:09.297 --> 00:02:10.531
of assignment one.

00:02:10.531 --> 00:02:12.684
It'll be pretty much the same content.

00:02:12.684 --> 00:02:15.081
We're just reshuffling it
a little bit to make it,

00:02:15.081 --> 00:02:17.169
like, for example, upgrading
to work with Python 3,

00:02:17.169 --> 00:02:19.082
rather than Python 2.7.

00:02:19.082 --> 00:02:20.943
And some of these minor cosmetic changes,

00:02:20.943 --> 00:02:23.068
but the content of the
assignment will still be the same

00:02:23.068 --> 00:02:24.742
as last year.

00:02:24.742 --> 00:02:27.332
So, in this assignment you'll
be implementing your own

00:02:27.332 --> 00:02:28.665
k-nearest neighbor classifier,

00:02:28.665 --> 00:02:30.686
which we're going to talk
about in this lecture.

00:02:30.686 --> 00:02:33.514
You'll also implement several
different linear classifiers,

00:02:33.514 --> 00:02:35.694
including the SVM and Softmax,

00:02:35.694 --> 00:02:37.927
as well as a simple
two-layer neural network.

00:02:37.927 --> 00:02:39.410
And we'll cover all this content

00:02:39.410 --> 00:02:42.160
over the next couple of lectures.

00:02:43.112 --> 00:02:46.168
So, all of our assignments
are using Python and NumPy.

00:02:46.168 --> 00:02:49.229
If you aren't familiar
with Python or NumPy,

00:02:49.229 --> 00:02:51.608
then we have written a
tutorial that you can find

00:02:51.608 --> 00:02:54.312
on the course website to
try and get you up to speed.

00:02:54.312 --> 00:02:56.856
But, this is, actually, pretty important.

00:02:56.856 --> 00:02:59.211
NumPy lets you write these
very efficient vectorized

00:02:59.211 --> 00:03:02.236
operations that let you do
quite a lot of computation

00:03:02.236 --> 00:03:03.856
in just a couple lines of code.

00:03:03.856 --> 00:03:06.081
So this is super important for pretty much

00:03:06.081 --> 00:03:09.087
all aspects of numerical
computing and machine learning

00:03:09.087 --> 00:03:10.288
and everything like that,

00:03:10.288 --> 00:03:13.424
is efficiently implementing
these vectorized operations.

00:03:13.424 --> 00:03:15.600
And you'll get a lot of practice with this

00:03:15.600 --> 00:03:16.964
on the first assignment.

00:03:16.964 --> 00:03:19.433
So, for those of you who
don't have a lot of experience

00:03:19.433 --> 00:03:23.449
with Matlab or NumPy or
other types of vectorized

00:03:23.449 --> 00:03:26.407
tensor computation, I recommend
that you start looking

00:03:26.407 --> 00:03:27.606
at this assignment pretty early

00:03:27.606 --> 00:03:32.175
and also, read carefully
through the tutorial.

00:03:32.175 --> 00:03:34.255
The other thing I wanted to talk about

00:03:34.255 --> 00:03:36.287
is that we're happy to announce that

00:03:36.287 --> 00:03:38.867
we're officially supported
through Google Cloud

00:03:38.867 --> 00:03:40.198
for this class.

00:03:40.198 --> 00:03:43.630
So, Google Cloud is somewhat
similar to Amazon AWS.

00:03:43.630 --> 00:03:46.694
You can go and start virtual
machines up in the cloud.

00:03:46.694 --> 00:03:50.432
These virtual machines can have GPUs.

00:03:50.432 --> 00:03:53.197
We're working on the tutorial
for exactly how to use

00:03:53.197 --> 00:03:55.386
Google Cloud and get it to
work for the assignments.

00:03:55.386 --> 00:03:58.517
But our intention is that
you'll be able to just download

00:03:58.517 --> 00:04:00.776
some image, and it'll be very seamless

00:04:00.776 --> 00:04:02.366
for you to work on the assignments

00:04:02.366 --> 00:04:04.723
on one of these instances on the cloud.

00:04:04.723 --> 00:04:07.086
And because Google has, very generously,

00:04:07.086 --> 00:04:08.437
supported this course,

00:04:08.437 --> 00:04:10.414
we'll be able to distribute to each of you

00:04:10.414 --> 00:04:14.122
coupons that let you use
Google Cloud credits for free

00:04:14.122 --> 00:04:15.582
for the class.

00:04:15.582 --> 00:04:18.571
So you can feel free to use
these for the assignments

00:04:18.571 --> 00:04:20.084
and also for the course projects

00:04:20.084 --> 00:04:22.607
when you want to start using
GPUs and larger machines

00:04:22.607 --> 00:04:24.015
and whatnot.

00:04:24.015 --> 00:04:25.976
So, we'll post more details about that,

00:04:25.976 --> 00:04:28.023
probably, on Piazza later today.

00:04:28.023 --> 00:04:29.300
But, I just wanted to mention,

00:04:29.300 --> 00:04:31.241
because I know there had
been a couple of questions

00:04:31.241 --> 00:04:33.369
about, can I use my laptop?

00:04:33.369 --> 00:04:34.389
Do I have to run on corn?

00:04:34.389 --> 00:04:35.566
Do I have to, whatever?

00:04:35.566 --> 00:04:37.607
And the answer is that,
you'll be able to run on

00:04:37.607 --> 00:04:41.774
Google Cloud and we'll provide
you some coupons for that.

00:04:43.923 --> 00:04:44.756
Yeah, so,

00:04:45.959 --> 00:04:47.818
those are, kind of, the
major administrative issues

00:04:47.818 --> 00:04:49.716
I wanted to talk about today.

00:04:49.716 --> 00:04:53.681
And then, let's dive into the content.

00:04:53.681 --> 00:04:55.463
So, the last lecture
we talked a little bit

00:04:55.463 --> 00:04:58.125
about this task of image classification,

00:04:58.125 --> 00:05:00.450
which is really a core
task in computer vision.

00:05:00.450 --> 00:05:02.444
And this is something
that we'll really focus on

00:05:02.444 --> 00:05:04.048
throughout the course of the class.

00:05:04.048 --> 00:05:04.964
Is, exactly,

00:05:04.964 --> 00:05:07.832
how do we work on this
image classification task?

00:05:07.832 --> 00:05:09.750
So, a little bit more concretely,

00:05:09.750 --> 00:05:12.142
when you're doing image classification,

00:05:12.142 --> 00:05:14.259
your system receives some input image,

00:05:14.259 --> 00:05:16.457
which is this cute cat in this example,

00:05:16.457 --> 00:05:20.007
and the system is aware
of some predetermined set

00:05:20.007 --> 00:05:22.455
of categories or labels.

00:05:22.455 --> 00:05:25.465
So, these might be, like,
a dog or a cat or a truck

00:05:25.465 --> 00:05:29.134
or a plane, and there's some
fixed set of category labels,

00:05:29.134 --> 00:05:31.292
and the job of the computer
is to look at the picture

00:05:31.292 --> 00:05:34.831
and assign it one of these
fixed category labels.

00:05:34.831 --> 00:05:36.444
This seems like a really easy problem,

00:05:36.444 --> 00:05:40.268
because so much of your own
visual system in your brain

00:05:40.268 --> 00:05:42.480
is hardwired to doing these, sort of,

00:05:42.480 --> 00:05:44.346
visual recognition tasks.

00:05:44.346 --> 00:05:46.336
But this is actually a
really, really hard problem

00:05:46.336 --> 00:05:48.232
for a machine.

00:05:48.232 --> 00:05:50.306
So, if you dig in and
think about, actually,

00:05:50.306 --> 00:05:53.236
what does a computer see
when it looks at this image,

00:05:53.236 --> 00:05:55.852
it definitely doesn't get
this holistic idea of a cat

00:05:55.852 --> 00:05:57.428
that you see when you look at it.

00:05:57.428 --> 00:05:59.418
And the computer really
is representing the image

00:05:59.418 --> 00:06:01.755
as this gigantic grid of numbers.

00:06:01.755 --> 00:06:05.922
So, the image might be something
like 800 by 600 pixels.

00:06:07.371 --> 00:06:10.135
And each pixel is
represented by three numbers,

00:06:10.135 --> 00:06:13.176
giving the red, green, and
blue values for that pixel.

00:06:13.176 --> 00:06:14.106
So, to the computer,

00:06:14.106 --> 00:06:15.769
this is just a gigantic grid of numbers.

00:06:15.769 --> 00:06:18.569
And it's very difficult
to distill the cat-ness

00:06:18.569 --> 00:06:22.508
out of this, like, giant array
of thousands, or whatever,

00:06:22.508 --> 00:06:24.841
very many different numbers.

00:06:26.956 --> 00:06:30.186
So, we refer to this
problem as the semantic gap.

00:06:30.186 --> 00:06:32.735
This idea of a cat, or
this label of a cat,

00:06:32.735 --> 00:06:35.462
is a semantic label that
we're assigning to this image,

00:06:35.462 --> 00:06:38.771
and there's this huge gap between
the semantic idea of a cat

00:06:38.771 --> 00:06:42.897
and these pixel values that the
computer is actually seeing.

00:06:42.897 --> 00:06:45.503
And this is a really hard problem because

00:06:45.503 --> 00:06:48.901
you can change the picture
in very small, subtle ways

00:06:48.901 --> 00:06:51.916
that will cause this pixel
grid to change entirely.

00:06:51.916 --> 00:06:54.038
So, for example, if we took this same cat,

00:06:54.038 --> 00:06:55.622
and if the cat happened to sit still

00:06:55.622 --> 00:06:57.352
and not even twitch, not move a muscle,

00:06:57.352 --> 00:06:58.782
which is never going to happen,

00:06:58.782 --> 00:07:01.077
but we moved the camera to the other side,

00:07:01.077 --> 00:07:04.050
then every single grid,
every single pixel,

00:07:04.050 --> 00:07:05.620
in this giant grid of numbers

00:07:05.620 --> 00:07:06.921
would be completely different.

00:07:06.921 --> 00:07:09.431
But, somehow, it's still
representing the same cat.

00:07:09.431 --> 00:07:12.754
And our algorithms need
to be robust to this.

00:07:12.754 --> 00:07:15.131
But, not only viewpoint is one problem,

00:07:15.131 --> 00:07:16.260
another is illumination.

00:07:16.260 --> 00:07:18.124
There can be different
lighting conditions going on

00:07:18.124 --> 00:07:19.298
in the scene.

00:07:19.298 --> 00:07:22.500
Whether the cat is appearing
in this very dark, moody scene,

00:07:22.500 --> 00:07:25.520
or like is this very bright,
sunlit scene, it's still a cat,

00:07:25.520 --> 00:07:28.488
and our algorithms need
to be robust to that.

00:07:28.488 --> 00:07:30.145
Objects can also deform.

00:07:30.145 --> 00:07:32.355
I think cats are, maybe,
among the more deformable

00:07:32.355 --> 00:07:34.549
of animals that you might see out there.

00:07:34.549 --> 00:07:37.825
And cats can really assume a
lot of different, varied poses

00:07:37.825 --> 00:07:38.940
and positions.

00:07:38.940 --> 00:07:40.774
And our algorithms should
be robust to these different

00:07:40.774 --> 00:07:42.441
kinds of transforms.

00:07:43.442 --> 00:07:45.823
There can also be problems of occlusion,

00:07:45.823 --> 00:07:49.762
where you might only see part
of a cat, like, just the face,

00:07:49.762 --> 00:07:51.865
or in this extreme example,
just a tail peeking out

00:07:51.865 --> 00:07:53.686
from under the couch cushion.

00:07:53.686 --> 00:07:56.626
But, in these cases, it's pretty
easy for you, as a person,

00:07:56.626 --> 00:07:58.897
to realize that this is probably a cat,

00:07:58.897 --> 00:08:01.532
and you still recognize
these images as cats.

00:08:01.532 --> 00:08:03.929
And this is something that our algorithms

00:08:03.929 --> 00:08:05.659
also must be robust to,

00:08:05.659 --> 00:08:08.460
which is quite difficult, I think.

00:08:08.460 --> 00:08:10.792
There can also be problems
of background clutter,

00:08:10.792 --> 00:08:13.329
where maybe the foreground
object of the cat,

00:08:13.329 --> 00:08:15.582
could actually look quite
similar in appearance

00:08:15.582 --> 00:08:16.677
to the background.

00:08:16.677 --> 00:08:20.389
And this is another thing
that we need to handle.

00:08:20.389 --> 00:08:23.637
There's also this problem
of intraclass variation,

00:08:23.637 --> 00:08:26.776
that this one notion of
cat-ness, actually spans a lot of

00:08:26.776 --> 00:08:28.455
different visual appearances.

00:08:28.455 --> 00:08:30.340
And cats can come in
different shapes and sizes

00:08:30.340 --> 00:08:32.158
and colors and ages.

00:08:32.158 --> 00:08:34.154
And our algorithm, again, needs to work

00:08:34.154 --> 00:08:36.345
and handle all these different variations.

00:08:36.345 --> 00:08:40.033
So, this is actually a really,
really challenging problem.

00:08:40.033 --> 00:08:43.402
And it's sort of easy to
forget how easy this is

00:08:43.402 --> 00:08:46.264
because so much of your
brain is specifically tuned

00:08:46.264 --> 00:08:47.931
for dealing with these things.

00:08:47.931 --> 00:08:49.604
But now if we want our computer programs

00:08:49.604 --> 00:08:52.590
to deal with all of these
problems, all simultaneously,

00:08:52.590 --> 00:08:54.288
and not just for cats, by the way,

00:08:54.288 --> 00:08:56.832
but for just about any object
category you can imagine,

00:08:56.832 --> 00:08:59.052
this is a fantastically
challenging problem.

00:08:59.052 --> 00:09:00.686
And it's, actually, somewhat miraculous

00:09:00.686 --> 00:09:03.078
that this works at all, in my opinion.

00:09:03.078 --> 00:09:04.703
But, actually, not only does it work,

00:09:04.703 --> 00:09:07.285
but these things work very
close to human accuracy

00:09:07.285 --> 00:09:09.278
in some limited situations.

00:09:09.278 --> 00:09:12.379
And take only hundreds
of milliseconds to do so.

00:09:12.379 --> 00:09:14.921
So, this is some pretty
amazing, incredible technology,

00:09:14.921 --> 00:09:18.418
in my opinion, and over the
course of the rest of the class

00:09:18.418 --> 00:09:20.241
we will really see what
kinds of advancements

00:09:20.241 --> 00:09:22.241
have made this possible.

00:09:23.492 --> 00:09:24.728
So now, if you, kind of, think about

00:09:24.728 --> 00:09:27.565
what is the API for writing
an image classifier,

00:09:27.565 --> 00:09:30.098
you might sit down and try
to write a method in Python

00:09:30.098 --> 00:09:31.131
like this.

00:09:31.131 --> 00:09:32.739
Where you want to take in an image

00:09:32.739 --> 00:09:34.401
and then do some crazy magic

00:09:34.401 --> 00:09:36.185
and then, eventually,
spit out this class label

00:09:36.185 --> 00:09:38.180
to say cat or dog or whatnot.

00:09:38.180 --> 00:09:41.695
And there's really no obvious
way to do this, right?

00:09:41.695 --> 00:09:43.476
If you're taking an algorithms class

00:09:43.476 --> 00:09:44.981
and your task is to sort numbers

00:09:44.981 --> 00:09:46.837
or compute a convex hull

00:09:46.837 --> 00:09:49.535
or, even, do something
like RSA encryption,

00:09:49.535 --> 00:09:51.584
you, sort of, can write down an algorithm

00:09:51.584 --> 00:09:53.725
and enumerate all the
steps that need to happen

00:09:53.725 --> 00:09:55.773
in order for this things to work.

00:09:55.773 --> 00:09:58.325
But, when we're trying
to recognize objects,

00:09:58.325 --> 00:10:00.390
or recognize cats or images,

00:10:00.390 --> 00:10:02.838
there's no really clear,
explicit algorithm

00:10:02.838 --> 00:10:04.810
that makes intuitive sense,

00:10:04.810 --> 00:10:07.909
for how you might go about
recognizing these objects.

00:10:07.909 --> 00:10:09.874
So, this is, again, quite challenging,

00:10:09.874 --> 00:10:12.163
if you think about,

00:10:12.163 --> 00:10:13.447
if it was your first day programming

00:10:13.447 --> 00:10:15.464
and you had to sit down
and write this function,

00:10:15.464 --> 00:10:18.869
I think most people would be in trouble.

00:10:18.869 --> 00:10:19.740
That being said,

00:10:19.740 --> 00:10:21.907
people have definitely
made explicit attempts

00:10:21.907 --> 00:10:25.029
to try to write, sort
of, high-end coded rules

00:10:25.029 --> 00:10:27.004
for recognizing different animals.

00:10:27.004 --> 00:10:29.193
So, we touched on this a
little bit in the last lecture,

00:10:29.193 --> 00:10:32.095
but maybe one idea for cats is that,

00:10:32.095 --> 00:10:35.596
we know that cats have ears
and eyes and mouths and noses.

00:10:35.596 --> 00:10:37.945
And we know that edges,
from Hubel and Wiesel,

00:10:37.945 --> 00:10:39.512
we know that edges are pretty important

00:10:39.512 --> 00:10:41.641
when it comes to visual recognition.

00:10:41.641 --> 00:10:43.820
So one thing we might try to do is

00:10:43.820 --> 00:10:45.425
compute the edges of this image

00:10:45.425 --> 00:10:47.543
and then go in and try to
categorize all the different

00:10:47.543 --> 00:10:50.983
corners and boundaries, and
say that, if we have maybe

00:10:50.983 --> 00:10:53.146
three lines meeting this way,
then it might be a corner,

00:10:53.146 --> 00:10:55.186
and an ear has one corner
here and one corner there

00:10:55.186 --> 00:10:56.483
and one corner there,

00:10:56.483 --> 00:10:59.076
and then, kind of, write down
this explicit set of rules

00:10:59.076 --> 00:11:01.608
for recognizing cats.

00:11:01.608 --> 00:11:04.391
But this turns out not to work very well.

00:11:04.391 --> 00:11:06.127
One, it's super brittle.

00:11:06.127 --> 00:11:09.013
And, two, say, if you want
to start over for another

00:11:09.013 --> 00:11:11.876
object category, and maybe
not worry about cats,

00:11:11.876 --> 00:11:15.005
but talk about trucks or dogs
or fishes or something else,

00:11:15.005 --> 00:11:17.081
then you need to start all over again.

00:11:17.081 --> 00:11:19.853
So, this is really not a
very scalable approach.

00:11:19.853 --> 00:11:22.579
We want to come up with some
algorithm, or some method,

00:11:22.579 --> 00:11:25.181
for these recognition tasks

00:11:25.181 --> 00:11:27.360
which scales much more
naturally to all the variety

00:11:27.360 --> 00:11:29.360
of objects in the world.

00:11:31.311 --> 00:11:34.766
So, the insight that, sort
of, makes this all work

00:11:34.766 --> 00:11:38.092
is this idea of the data-driven approach.

00:11:38.092 --> 00:11:41.046
Rather than sitting down and
writing these hand-specified

00:11:41.046 --> 00:11:44.340
rules to try to craft exactly
what is a cat or a fish

00:11:44.340 --> 00:11:45.766
or what have you,

00:11:45.766 --> 00:11:47.845
instead, we'll go out onto the internet

00:11:47.845 --> 00:11:51.313
and collect a large
dataset of many, many cats

00:11:51.313 --> 00:11:53.524
and many, many airplanes
and many, many deer

00:11:53.524 --> 00:11:55.402
and different things like this.

00:11:55.402 --> 00:11:58.075
And we can actually use tools
like Google Image Search,

00:11:58.075 --> 00:11:59.138
or something like that,

00:11:59.138 --> 00:12:01.546
to go out and collect a very
large number of examples

00:12:01.546 --> 00:12:03.866
of these different categories.

00:12:03.866 --> 00:12:06.203
By the way, this actually
takes quite a lot of effort

00:12:06.203 --> 00:12:08.338
to go out and actually
collect these datasets

00:12:08.338 --> 00:12:10.873
but, luckily, there's a lot
of really good, high quality

00:12:10.873 --> 00:12:14.105
datasets out there already for you to use.

00:12:14.105 --> 00:12:16.073
Then once we get this dataset,

00:12:16.073 --> 00:12:18.536
we train this machine learning classifier

00:12:18.536 --> 00:12:20.869
that is going to ingest all of the data,

00:12:20.869 --> 00:12:22.339
summarize it in some way,

00:12:22.339 --> 00:12:23.733
and then spit out a model

00:12:23.733 --> 00:12:26.130
that summarizes the
knowledge of how to recognize

00:12:26.130 --> 00:12:28.446
these different object categories.

00:12:28.446 --> 00:12:30.134
Then finally, we'll
use this training model

00:12:30.134 --> 00:12:31.756
and apply it on new images

00:12:31.756 --> 00:12:33.601
that will then be able to recognize

00:12:33.601 --> 00:12:35.930
cats and dogs and whatnot.

00:12:35.930 --> 00:12:38.428
So here our API has changed a little bit.

00:12:38.428 --> 00:12:39.446
Rather than a single function

00:12:39.446 --> 00:12:42.099
that just inputs an image
and recognizes a cat,

00:12:42.099 --> 00:12:43.621
we have these two functions.

00:12:43.621 --> 00:12:48.084
One, called, train, that's
going to input images and labels

00:12:48.084 --> 00:12:49.115
and then output a model,

00:12:49.115 --> 00:12:52.030
and then, separately, another
function called, predict,

00:12:52.030 --> 00:12:54.013
which will input the model
and than make predictions

00:12:54.013 --> 00:12:55.276
for images.

00:12:55.276 --> 00:12:56.780
And this is, kind of, the key insight

00:12:56.780 --> 00:12:59.178
that allowed all these things
to start working really well

00:12:59.178 --> 00:13:01.928
over the last 10, 20 years or so.

00:13:05.784 --> 00:13:08.096
So, this class is primarily
about neural networks

00:13:08.096 --> 00:13:09.572
and convolutional neural networks

00:13:09.572 --> 00:13:11.111
and deep learning and all that,

00:13:11.111 --> 00:13:14.446
but this idea of a data-driven
approach is much more general

00:13:14.446 --> 00:13:15.819
than just deep learning.

00:13:15.819 --> 00:13:17.832
And I think it's useful to, sort of,

00:13:17.832 --> 00:13:19.145
step through this process

00:13:19.145 --> 00:13:20.924
for a very simple classifier first,

00:13:20.924 --> 00:13:23.058
before we get to these big, complex ones.

00:13:23.058 --> 00:13:26.898
So, probably, the simplest
classifier you can imagine

00:13:26.898 --> 00:13:28.907
is something we call nearest neighbor.

00:13:28.907 --> 00:13:31.243
The algorithm is pretty dumb, honestly.

00:13:31.243 --> 00:13:34.315
So, during the training
step we won't do anything,

00:13:34.315 --> 00:13:36.966
we'll just memorize all
of the training data.

00:13:36.966 --> 00:13:39.108
So this is very simple.

00:13:39.108 --> 00:13:41.006
And now, during the prediction step,

00:13:41.006 --> 00:13:43.191
we're going to take some new image

00:13:43.191 --> 00:13:45.597
and go and try to find
the most similar image

00:13:45.597 --> 00:13:47.964
in the training data to that new image,

00:13:47.964 --> 00:13:51.453
and now predict the label
of that most similar image.

00:13:51.453 --> 00:13:53.169
A very simple algorithm.

00:13:53.169 --> 00:13:55.104
But it, sort of, has a lot
of these nice properties

00:13:55.104 --> 00:13:58.771
with respect to
data-drivenness and whatnot.

00:13:59.977 --> 00:14:01.680
So, to be a little bit more concrete,

00:14:01.680 --> 00:14:04.951
you might imagine working on
this dataset called CIFAR-10,

00:14:04.951 --> 00:14:07.331
which is very commonly
used in machine learning,

00:14:07.331 --> 00:14:09.259
as kind of a small test case.

00:14:09.259 --> 00:14:11.579
And you'll be working with
this dataset on your homework.

00:14:11.579 --> 00:14:15.159
So, the CIFAR-10 dataset gives
you 10 different classes,

00:14:15.159 --> 00:14:17.965
airplanes and automobiles and
birds and cats and different

00:14:17.965 --> 00:14:19.920
things like that.

00:14:19.920 --> 00:14:22.073
And for each of those 10 categories

00:14:22.073 --> 00:14:24.990
it provides 50,000 training images,

00:14:27.173 --> 00:14:30.250
roughly evenly distributed
across these 10 categories.

00:14:30.250 --> 00:14:33.648
And then 10,000 additional testing images

00:14:33.648 --> 00:14:37.565
that you're supposed to
test your algorithm on.

00:14:38.707 --> 00:14:41.265
So here's an example
of applying this simple

00:14:41.265 --> 00:14:44.003
nearest neighbor classifier
to some of these test images

00:14:44.003 --> 00:14:45.741
on CIFAR-10.

00:14:45.741 --> 00:14:48.703
So, on this grid on the right,

00:14:48.703 --> 00:14:50.863
for the left most column,

00:14:50.863 --> 00:14:53.693
gives a test image in
the CIFAR-10 dataset.

00:14:53.693 --> 00:14:58.375
And now on the right, we've
sorted the training images

00:14:58.375 --> 00:15:01.328
and show the most similar training images

00:15:01.328 --> 00:15:03.571
to each of these test examples.

00:15:03.571 --> 00:15:05.878
And you can see that they
look kind of visually similar

00:15:05.878 --> 00:15:07.938
to the training images,

00:15:07.938 --> 00:15:10.974
although they are not
always correct, right?

00:15:10.974 --> 00:15:13.407
So, maybe on the second row,
we see that the testing,

00:15:13.407 --> 00:15:14.687
this is kind of hard to see,

00:15:14.687 --> 00:15:17.340
because these images are 32 by 32 pixels,

00:15:17.340 --> 00:15:18.768
you need to really dive in there

00:15:18.768 --> 00:15:21.224
and try to make your best guess.

00:15:21.224 --> 00:15:23.850
But, this image is a dog and
it's nearest neighbor is also

00:15:23.850 --> 00:15:28.072
a dog, but this next one,
I think is actually a deer

00:15:28.072 --> 00:15:30.006
or a horse or something else.

00:15:30.006 --> 00:15:33.237
But, you can see that it
looks quite visually similar,

00:15:33.237 --> 00:15:34.773
because there's kind of a
white blob in the middle

00:15:34.773 --> 00:15:36.370
and whatnot.

00:15:36.370 --> 00:15:38.630
So, if we're applying the
nearest neighbor algorithm

00:15:38.630 --> 00:15:39.651
to this image,

00:15:39.651 --> 00:15:42.707
we'll find the closest
example in the training set.

00:15:42.707 --> 00:15:45.107
And now, the closest
example, we know it's label,

00:15:45.107 --> 00:15:47.135
because it comes from the training set.

00:15:47.135 --> 00:15:49.735
And now, we'll simply say that
this testing image is also

00:15:49.735 --> 00:15:50.875
a dog.

00:15:50.875 --> 00:15:53.946
You can see from these
examples that is probably not

00:15:53.946 --> 00:15:55.851
going to work very well,

00:15:55.851 --> 00:16:00.018
but it's still kind of a
nice example to work through.

00:16:00.939 --> 00:16:03.724
But then, one detail
that we need to know is,

00:16:03.724 --> 00:16:05.013
given a pair of images,

00:16:05.013 --> 00:16:06.908
how can we actually compare them?

00:16:06.908 --> 00:16:09.152
Because, if we're going to take
our test image and compare it

00:16:09.152 --> 00:16:10.573
to all the training images,

00:16:10.573 --> 00:16:12.165
we actually have many different choices

00:16:12.165 --> 00:16:15.640
for exactly what that comparison
function should look like.

00:16:15.640 --> 00:16:17.687
So, in the example in the previous slide,

00:16:17.687 --> 00:16:20.008
we've used what's called the L1 distance,

00:16:20.008 --> 00:16:22.547
also sometimes called
the Manhattan distance.

00:16:22.547 --> 00:16:25.725
So, this is a really
sort of simple, easy idea

00:16:25.725 --> 00:16:27.448
for comparing images.

00:16:27.448 --> 00:16:31.691
And that's that we're going to
just compare individual pixels

00:16:31.691 --> 00:16:32.969
in these images.

00:16:32.969 --> 00:16:36.519
So, supposing that our test
image is maybe just a tiny

00:16:36.519 --> 00:16:39.346
four by four image of pixel values,

00:16:39.346 --> 00:16:41.802
then we're take this upper-left hand pixel

00:16:41.802 --> 00:16:43.172
of the test image,

00:16:43.172 --> 00:16:45.077
subtract off the value
in the training image,

00:16:45.077 --> 00:16:46.242
take the absolute value,

00:16:46.242 --> 00:16:49.068
and get the difference in that
pixel between the two images.

00:16:49.068 --> 00:16:50.916
And then, sum all these
up across all the pixels

00:16:50.916 --> 00:16:51.950
in the image.

00:16:51.950 --> 00:16:54.213
So, this is kind of a stupid
way to compare images,

00:16:54.213 --> 00:16:57.963
but it does some reasonable
things sometimes.

00:16:57.963 --> 00:16:59.535
But, this gives us a very concrete way

00:16:59.535 --> 00:17:01.991
to measure the difference
between two images.

00:17:01.991 --> 00:17:05.064
And in this case, we have
this difference of 456

00:17:05.064 --> 00:17:07.147
between these two images.

00:17:08.446 --> 00:17:10.944
So, here's some full Python code

00:17:10.944 --> 00:17:13.233
for implementing this
nearest neighbor classifier

00:17:13.233 --> 00:17:16.582
and you can see it's pretty
short and pretty concise

00:17:16.583 --> 00:17:18.804
because we've made use of
many of these vectorized

00:17:18.804 --> 00:17:21.348
operations offered by NumPy.

00:17:21.348 --> 00:17:25.078
So, here we can see that
this training function,

00:17:25.079 --> 00:17:26.247
that we talked about earlier,

00:17:26.247 --> 00:17:28.946
is, again, very simple, in
the case of nearest neighbor,

00:17:28.946 --> 00:17:30.573
you just memorize the training data,

00:17:30.573 --> 00:17:33.427
there's not really much to do here.

00:17:33.427 --> 00:17:35.869
And now, at test time, we're
going to take in our image

00:17:35.869 --> 00:17:39.126
and then go in and compare
using this L1 distance function,

00:17:39.126 --> 00:17:41.984
our test image to each of
these training examples

00:17:41.984 --> 00:17:45.395
and find the most similar
example in the training set.

00:17:45.395 --> 00:17:47.371
And you can see that, we're
actually able to do this

00:17:47.371 --> 00:17:50.113
in just one or two lines of Python code

00:17:50.113 --> 00:17:53.882
by utilizing these vectorized
operations in NumPy.

00:17:53.882 --> 00:17:55.779
So, this is something that
you'll get practice with

00:17:55.779 --> 00:17:57.779
on the first assignment.

00:17:58.628 --> 00:18:02.179
So now, a couple questions
about this simple classifier.

00:18:02.179 --> 00:18:04.910
First, if we have N examples
in our training set,

00:18:04.910 --> 00:18:09.077
then how fast can we expect
training and testing to be?

00:18:12.233 --> 00:18:13.998
Well, training is probably constant

00:18:13.998 --> 00:18:15.712
because we don't really
need to do anything,

00:18:15.712 --> 00:18:17.878
we just need to memorize the data.

00:18:17.878 --> 00:18:19.241
And if you're just copying a pointer,

00:18:19.241 --> 00:18:20.663
that's going to be constant time

00:18:20.663 --> 00:18:22.703
no matter how big your dataset is.

00:18:22.703 --> 00:18:26.209
But now, at test time we need
to do this comparison stop

00:18:26.209 --> 00:18:27.679
and compare our test image

00:18:27.679 --> 00:18:31.099
to each of the N training
examples in the dataset.

00:18:31.099 --> 00:18:33.766
And this is actually quite slow.

00:18:34.991 --> 00:18:37.448
So, this is actually somewhat backwards,

00:18:37.448 --> 00:18:38.641
if you think about it.

00:18:38.641 --> 00:18:40.350
Because, in practice,

00:18:40.350 --> 00:18:43.489
we want our classifiers to
be slow at training time

00:18:43.489 --> 00:18:45.326
and then fast at testing time.

00:18:45.326 --> 00:18:47.858
Because, you might imagine,
that a classifier might go

00:18:47.858 --> 00:18:49.882
and be trained in a data center somewhere

00:18:49.882 --> 00:18:52.045
and you can afford to
spend a lot of computation

00:18:52.045 --> 00:18:54.640
at training time to make
the classifier really good.

00:18:54.640 --> 00:18:55.473
But then,

00:18:55.473 --> 00:18:57.566
when you go and deploy the
classifier at test time,

00:18:57.566 --> 00:18:59.503
you want it to run on your mobile phone

00:18:59.503 --> 00:19:02.248
or in a browser or some
other low power device,

00:19:02.248 --> 00:19:04.493
and you really want the
testing time performance

00:19:04.493 --> 00:19:07.075
of your classifier to be quite fast.

00:19:07.075 --> 00:19:09.929
So, from this perspective, this
nearest neighbor algorithm,

00:19:09.929 --> 00:19:11.826
is, actually, a little bit backwards.

00:19:11.826 --> 00:19:13.591
And we'll see that once we move to

00:19:13.591 --> 00:19:14.720
convolutional neural networks,

00:19:14.720 --> 00:19:16.420
and other types of parametric models,

00:19:16.420 --> 00:19:18.286
they'll be the reverse of this.

00:19:18.286 --> 00:19:20.226
Where you'll spend a lot of
compute at training time,

00:19:20.226 --> 00:19:24.936
but then they'll be quite
fast at testing time.

00:19:24.936 --> 00:19:25.969
So then, the question is,

00:19:25.969 --> 00:19:28.139
what exactly does this
nearest neighbor algorithm

00:19:28.139 --> 00:19:30.816
look like when you apply it in practice?

00:19:30.816 --> 00:19:34.049
So, here we've drawn, what
we call the decision regions

00:19:34.049 --> 00:19:36.130
of a nearest neighbor classifier.

00:19:36.130 --> 00:19:39.636
So, here our training set
consists of these points

00:19:39.636 --> 00:19:42.021
in the two dimensional plane,

00:19:42.021 --> 00:19:45.388
where the color of the point
represents the category,

00:19:45.388 --> 00:19:47.547
or the class label, of that point.

00:19:47.547 --> 00:19:49.221
So, here we see we have five classes

00:19:49.221 --> 00:19:51.543
and some blue ones up in the corner here,

00:19:51.543 --> 00:19:53.921
some purple ones in the
upper-right hand corner.

00:19:53.921 --> 00:19:56.491
And now for each pixel
in this entire plane,

00:19:56.491 --> 00:20:00.860
we've gone and computed
what is the nearest example

00:20:00.860 --> 00:20:02.560
in these training data,

00:20:02.560 --> 00:20:04.391
and then colored the
point of the background

00:20:04.391 --> 00:20:06.954
corresponding to what is the class label.

00:20:06.954 --> 00:20:09.049
So, you can see that this
nearest neighbor classifier

00:20:09.049 --> 00:20:11.032
is just sort of carving up the space

00:20:11.032 --> 00:20:14.979
and coloring the space
according to the nearby points.

00:20:14.979 --> 00:20:18.320
But this classifier is maybe not so great.

00:20:18.320 --> 00:20:20.002
And by looking at this picture

00:20:20.002 --> 00:20:22.568
we can start to see some of the
problems that might come out

00:20:22.568 --> 00:20:24.676
with a nearest neighbor classifier.

00:20:24.676 --> 00:20:27.487
For one, this central
region actually contains

00:20:27.487 --> 00:20:29.208
mostly green points,

00:20:29.208 --> 00:20:31.591
but one little yellow point in the middle.

00:20:31.591 --> 00:20:34.406
But because we're just looking
at the nearest neighbor,

00:20:34.406 --> 00:20:36.411
this causes a little
yellow island to appear

00:20:36.411 --> 00:20:38.552
in this middle of this green cluster.

00:20:38.552 --> 00:20:40.087
And that's, maybe, not so great.

00:20:40.087 --> 00:20:44.081
Maybe those points actually
should have been green.

00:20:44.081 --> 00:20:47.990
And then, similarly we also
see these, sort of, fingers,

00:20:47.990 --> 00:20:50.225
like the green region
pushing into the blue region,

00:20:50.225 --> 00:20:52.450
again, due to the presence of one point,

00:20:52.450 --> 00:20:55.180
which may have been noisy or spurious.

00:20:55.180 --> 00:20:58.205
So, this kind of motivates
a slight generalization

00:20:58.205 --> 00:21:01.606
of this algorithm called
k-nearest neighbors.

00:21:01.606 --> 00:21:05.070
So rather than just looking for
the single nearest neighbor,

00:21:05.070 --> 00:21:08.021
instead we'll do something
a little bit fancier

00:21:08.021 --> 00:21:10.627
and find K of our nearest neighbors,

00:21:10.627 --> 00:21:12.240
according to our distance metric,

00:21:12.240 --> 00:21:15.057
and then take a vote among
each of our neighbors.

00:21:15.057 --> 00:21:16.708
And then predict the majority vote

00:21:16.708 --> 00:21:18.733
among our neighbors.

00:21:18.733 --> 00:21:21.032
You can imagine slightly more
complex ways of doing this.

00:21:21.032 --> 00:21:22.919
Maybe you'd vote weighted on the distance,

00:21:22.919 --> 00:21:24.148
or something like that,

00:21:24.148 --> 00:21:27.912
but the simplest thing that
tends to work pretty well

00:21:27.912 --> 00:21:29.810
is just taking a majority vote.

00:21:29.810 --> 00:21:32.810
So here we've shown the
exact same set of points

00:21:32.810 --> 00:21:35.899
using this K=1 nearest
neighbor classifier,

00:21:35.899 --> 00:21:39.928
as well as K=3 and K=5 in
the middle and on the right.

00:21:39.928 --> 00:21:43.792
And once we move to K=3, you
can see that that spurious

00:21:43.792 --> 00:21:46.186
yellow point in the middle
of the green cluster

00:21:46.186 --> 00:21:49.369
is no longer causing the
points near that region

00:21:49.369 --> 00:21:50.966
to be classified as yellow.

00:21:50.966 --> 00:21:53.684
Now this entire green
portion in the middle

00:21:53.684 --> 00:21:55.852
is all being classified as green.

00:21:55.852 --> 00:21:57.737
You can also see that these fingers

00:21:57.737 --> 00:21:59.272
of the red and blue regions

00:21:59.272 --> 00:22:00.823
are starting to get smoothed out

00:22:00.823 --> 00:22:02.454
due to this majority voting.

00:22:02.454 --> 00:22:05.381
And then, once we move to the K=5 case,

00:22:05.381 --> 00:22:06.900
then these decision boundaries

00:22:06.900 --> 00:22:08.437
between the blue and red regions

00:22:08.437 --> 00:22:12.064
have become quite smooth and quite nice.

00:22:12.064 --> 00:22:13.818
So, generally when you're
using nearest neighbors

00:22:13.818 --> 00:22:14.727
classifiers,

00:22:14.727 --> 00:22:18.718
you almost always want
to use some value of K,

00:22:18.718 --> 00:22:20.771
which is larger than one

00:22:20.771 --> 00:22:22.897
because this tends to
smooth out your decision

00:22:22.897 --> 00:22:26.064
boundaries and lead to better results.

00:22:29.252 --> 00:22:30.159
Question?

00:22:30.159 --> 00:22:34.279
[student asking a question]

00:22:34.279 --> 00:22:35.208
Yes, so the question is,

00:22:35.208 --> 00:22:38.133
what is the deal with these white regions?

00:22:38.133 --> 00:22:41.025
The white regions are
where there was no majority

00:22:41.025 --> 00:22:43.247
among the k-nearest neighbors.

00:22:43.247 --> 00:22:45.521
You could imagine maybe doing
something slightly fancier

00:22:45.521 --> 00:22:48.972
and maybe taking a guess
or randomly selecting among

00:22:48.972 --> 00:22:50.472
the majority winners,

00:22:50.472 --> 00:22:52.729
but for this simple example
we're just coloring it white

00:22:52.729 --> 00:22:54.335
to indicate there was no nearest neighbor

00:22:54.335 --> 00:22:55.668
in those points.

00:23:00.005 --> 00:23:02.439
Whenever we're thinking
about computer vision

00:23:02.439 --> 00:23:04.225
I think it's really useful to kind of flip

00:23:04.225 --> 00:23:06.616
back and forth between
several different viewpoints.

00:23:06.616 --> 00:23:09.480
One, is this idea of high
dimensional points in the plane,

00:23:09.480 --> 00:23:13.049
and then the other is actually
looking at concrete images.

00:23:13.049 --> 00:23:15.118
Because the pixels of the image actually

00:23:15.118 --> 00:23:18.185
allow us to think of these
images as high dimensional

00:23:18.185 --> 00:23:19.048
vectors.

00:23:19.048 --> 00:23:21.181
And it's sort of useful to
ping pong back and forth

00:23:21.181 --> 00:23:23.395
between these two different viewpoints.

00:23:23.395 --> 00:23:26.193
So then, sort of taking
this k-nearest neighbor

00:23:26.193 --> 00:23:27.876
and going back to the images

00:23:27.876 --> 00:23:29.443
you can see that it's
actually not very good.

00:23:29.443 --> 00:23:31.216
Here I've colored in red and green

00:23:31.216 --> 00:23:33.265
which images would actually
be classified correctly

00:23:33.265 --> 00:23:35.385
or incorrectly according
to their nearest neighbor.

00:23:35.385 --> 00:23:38.288
And you can see that it's
really not very good.

00:23:38.288 --> 00:23:41.459
But maybe if we used a larger value of K

00:23:41.459 --> 00:23:43.690
then this would involve
actually voting among

00:23:43.690 --> 00:23:45.586
maybe the top three or the top five

00:23:45.586 --> 00:23:47.264
or maybe even the whole row.

00:23:47.264 --> 00:23:48.806
And you could imagine that
that would end up being

00:23:48.806 --> 00:23:52.158
a lot more robust to some
of this noise that we see

00:23:52.158 --> 00:23:55.325
when retrieving neighbors in this way.

00:23:57.070 --> 00:23:59.803
So another choice we
have when we're working

00:23:59.803 --> 00:24:01.666
with the k-nearest neighbor algorithm

00:24:01.666 --> 00:24:04.187
is determining exactly
how we should be comparing

00:24:04.187 --> 00:24:05.727
our different points.

00:24:05.727 --> 00:24:09.624
For the examples so far we've just shown

00:24:09.624 --> 00:24:11.666
we've talked about this L1 distance

00:24:11.666 --> 00:24:13.522
which takes the sum of the absolute values

00:24:13.522 --> 00:24:15.126
between the pixels.

00:24:15.126 --> 00:24:18.299
But another common choice is
the L2 or Euclidean distance

00:24:18.299 --> 00:24:21.266
where you take the square
root of the sum of the squares

00:24:21.266 --> 00:24:24.491
and take this as your distance.

00:24:24.491 --> 00:24:26.588
Choosing different
distance metrics actually

00:24:26.588 --> 00:24:28.727
is a pretty interesting topic

00:24:28.727 --> 00:24:30.072
because different distance metrics

00:24:30.072 --> 00:24:32.266
make different assumptions
about the underlying

00:24:32.266 --> 00:24:35.386
geometry or topology that
you'd expect in the space.

00:24:35.386 --> 00:24:39.103
So, this L1 distance, underneath
this, this is actually

00:24:39.103 --> 00:24:41.688
a circle according to the L1 distance

00:24:41.688 --> 00:24:43.487
and it forms this square shape thing

00:24:43.487 --> 00:24:45.020
around the origin.

00:24:45.020 --> 00:24:47.637
Where each of the points
on this, on the square,

00:24:47.637 --> 00:24:51.017
is equidistant from the
origin according to L1,

00:24:51.017 --> 00:24:53.116
whereas with the L2 or Euclidean distance

00:24:53.116 --> 00:24:55.290
then this circle is a familiar circle,

00:24:55.290 --> 00:24:57.241
it looks like what you'd expect.

00:24:57.241 --> 00:24:59.521
So one interesting thing to
point out between these two

00:24:59.521 --> 00:25:00.798
metrics in particular,

00:25:00.798 --> 00:25:03.672
is that the L1 distance
depends on your choice

00:25:03.672 --> 00:25:05.135
of coordinates system.

00:25:05.135 --> 00:25:07.306
So if you were to rotate
the coordinate frame

00:25:07.306 --> 00:25:08.902
that would actually change the L1 distance

00:25:08.902 --> 00:25:10.201
between the points.

00:25:10.201 --> 00:25:13.287
Whereas changing the coordinate
frame in the L2 distance

00:25:13.287 --> 00:25:15.491
doesn't matter, it's the
same thing no matter what

00:25:15.491 --> 00:25:18.281
your coordinate frame is.

00:25:18.281 --> 00:25:21.721
Maybe if your input features,
if the individual entries

00:25:21.721 --> 00:25:23.866
in your vector have some important meaning

00:25:23.866 --> 00:25:24.791
for your task,

00:25:24.791 --> 00:25:27.935
then maybe somehow L1 might
be a more natural fit.

00:25:27.935 --> 00:25:30.533
But if it's just a generic
vector in some space

00:25:30.533 --> 00:25:32.373
and you don't know which
of the different elements,

00:25:32.373 --> 00:25:34.121
you don't know what they actually mean,

00:25:34.121 --> 00:25:37.531
then maybe L2 is slightly more natural.

00:25:37.531 --> 00:25:40.154
And another point here is that

00:25:40.154 --> 00:25:41.839
by using different distance metrics

00:25:41.839 --> 00:25:43.474
we can actually generalize
the k-nearest neighbor

00:25:43.474 --> 00:25:46.343
classifier to many, many
different types of data,

00:25:46.343 --> 00:25:48.280
not just vectors, not just images.

00:25:48.280 --> 00:25:51.410
So, for example, imagine you
wanted to classify pieces

00:25:51.410 --> 00:25:54.073
of text, then the only
thing you need to do

00:25:54.073 --> 00:25:55.366
to use k-nearest neighbors

00:25:55.366 --> 00:25:57.716
is to specify some distance function

00:25:57.716 --> 00:26:01.692
that can measure distances
between maybe two paragraphs

00:26:01.692 --> 00:26:03.831
or two sentences or something like that.

00:26:03.831 --> 00:26:06.908
So, simply by specifying
different distance metrics

00:26:06.908 --> 00:26:09.377
we can actually apply this
algorithm very generally

00:26:09.377 --> 00:26:12.701
to basically any type of data.

00:26:12.701 --> 00:26:14.598
Even though it's a kind
of simple algorithm,

00:26:14.598 --> 00:26:17.200
in general, it's a very
good thing to try first

00:26:17.200 --> 00:26:20.283
when you're looking at a new problem.

00:26:21.805 --> 00:26:23.986
So then, it's also kind of
interesting to think about

00:26:23.986 --> 00:26:25.854
what is actually happening geometrically

00:26:25.854 --> 00:26:28.441
if we choose different distance metrics.

00:26:28.441 --> 00:26:31.370
So here we see the same
set of points on the left

00:26:31.370 --> 00:26:33.577
using the L1, or Manhattan distance,

00:26:33.577 --> 00:26:36.767
and then, on the right,
using the familiar L2,

00:26:36.767 --> 00:26:38.087
or Euclidean distance.

00:26:38.087 --> 00:26:40.298
And you can see that the
shapes of these decision

00:26:40.298 --> 00:26:42.476
boundaries actually change quite a bit

00:26:42.476 --> 00:26:44.093
between the two metrics.

00:26:44.093 --> 00:26:46.792
So when you're looking at
L1 these decision boundaries

00:26:46.792 --> 00:26:49.337
tend to follow the coordinate axes.

00:26:49.337 --> 00:26:52.168
And this is again because
the L1 depends on our choice

00:26:52.168 --> 00:26:53.451
of coordinate system.

00:26:53.451 --> 00:26:56.020
Where the L2 sort of doesn't
really care about the

00:26:56.020 --> 00:26:57.544
coordinate axis, it
just puts the boundaries

00:26:57.544 --> 00:27:00.294
where they should fall naturally.

00:27:04.161 --> 00:27:06.406
My confession is that
each of these examples

00:27:06.406 --> 00:27:08.669
that I've shown you is
actually from this interactive

00:27:08.669 --> 00:27:10.293
web demo that I built,

00:27:10.293 --> 00:27:12.918
where you can go and play
with this k-nearest neighbor

00:27:12.918 --> 00:27:14.618
classifier on your own.

00:27:14.618 --> 00:27:17.938
And this is really hard to
work on a projector screen.

00:27:17.938 --> 00:27:21.271
So maybe we'll do that on your own time.

00:27:26.820 --> 00:27:29.403
So, let's just go back to here.

00:27:32.951 --> 00:27:35.784
Man, this is kind of embarrassing.

00:28:07.103 --> 00:28:09.679
Okay, that was way more
trouble than it was worth.

00:28:09.679 --> 00:28:12.319
So, let's skip this, but I encourage you

00:28:12.319 --> 00:28:13.496
to go play with this in your browser.

00:28:13.496 --> 00:28:15.541
It's actually pretty fun

00:28:15.541 --> 00:28:17.742
and kind of nice to build intuition about

00:28:17.742 --> 00:28:19.246
how the decision boundary changes

00:28:19.246 --> 00:28:20.536
as you change the K

00:28:20.536 --> 00:28:23.529
and change your distance metric

00:28:23.529 --> 00:28:26.029
and all those sorts of things.

00:28:30.641 --> 00:28:32.387
Okay, so then the question is

00:28:32.387 --> 00:28:34.072
once you're actually trying
to use this algorithm

00:28:34.072 --> 00:28:36.146
in practice, there's several choices

00:28:36.146 --> 00:28:37.178
you need to make.

00:28:37.178 --> 00:28:39.426
We talked about choosing
different values of K.

00:28:39.426 --> 00:28:41.520
We talked about choosing
different distance metrics.

00:28:41.520 --> 00:28:43.403
And the question becomes

00:28:43.403 --> 00:28:45.584
how do you actually make
these choices for your problem

00:28:45.584 --> 00:28:47.286
and for your data?

00:28:47.286 --> 00:28:51.736
So, these choices, of things
like K and the distance metric,

00:28:51.736 --> 00:28:53.937
we call hyperparameters,

00:28:53.937 --> 00:28:56.456
because they are not necessarily
learned from the training

00:28:56.456 --> 00:28:57.289
data,

00:28:57.289 --> 00:29:00.267
instead these are choices about
your algorithm that you make

00:29:00.267 --> 00:29:01.313
ahead of time

00:29:01.313 --> 00:29:05.623
and there's no way to learn
them directly from the data.

00:29:05.623 --> 00:29:08.821
So, the question is how
do you set these things

00:29:08.821 --> 00:29:10.260
in practice?

00:29:10.260 --> 00:29:12.277
And they turn out to be
very problem-dependent.

00:29:12.277 --> 00:29:15.309
And the simple thing that
most people do is simply

00:29:15.309 --> 00:29:17.957
try different values of
hyperparameters for your data

00:29:17.957 --> 00:29:20.950
and for your problem, and
figure out which one works best.

00:29:20.950 --> 00:29:22.404
There's a question?

00:29:22.404 --> 00:29:26.071
[student asking a question]

00:29:29.589 --> 00:29:32.367
So, the question is, where L1
distance might be preferable

00:29:32.367 --> 00:29:34.447
to using L2 distance?

00:29:34.447 --> 00:29:36.800
I think it's mainly problem-dependent,

00:29:36.800 --> 00:29:38.106
it's sort of difficult to say

00:29:38.106 --> 00:29:40.371
in which cases you think
one might be better

00:29:40.371 --> 00:29:41.204
than the other.

00:29:41.204 --> 00:29:44.719
but I think that because L1
has this sort of coordinate

00:29:44.719 --> 00:29:48.877
dependency, it actually depends
on the coordinate system

00:29:48.877 --> 00:29:50.185
of your data,

00:29:50.185 --> 00:29:52.833
if you know that you have a vector,

00:29:52.833 --> 00:29:54.482
and maybe the individual
elements of the vector

00:29:54.482 --> 00:29:55.513
have meaning.

00:29:55.513 --> 00:29:58.583
Like maybe you're classifying
employees for some reason

00:29:58.583 --> 00:30:00.713
and then the different elements
of that vector correspond

00:30:00.713 --> 00:30:03.976
to different features or
aspects of an employee.

00:30:03.976 --> 00:30:06.432
Like their salary or the
number of years they've been

00:30:06.432 --> 00:30:08.778
working at the company
or something like that.

00:30:08.778 --> 00:30:10.949
So I think when your
individual elements actually

00:30:10.949 --> 00:30:11.860
have some meaning,

00:30:11.860 --> 00:30:14.297
is where I think maybe using
L1 might make a little bit

00:30:14.297 --> 00:30:15.850
more sense.

00:30:15.850 --> 00:30:17.623
But in general, again,
this is a hyperparameter

00:30:17.623 --> 00:30:19.989
and it really depends on
your problem and your data

00:30:19.989 --> 00:30:22.214
so the best answer is
just to try them both

00:30:22.214 --> 00:30:24.381
and see what works better.

00:30:28.381 --> 00:30:30.092
Even this idea of trying
out different values

00:30:30.092 --> 00:30:32.413
of hyperparameters and
seeing what works best,

00:30:32.413 --> 00:30:34.238
there are many different choices here.

00:30:34.238 --> 00:30:36.287
What exactly does it mean
to try hyperparameters

00:30:36.287 --> 00:30:38.268
and see what works best?

00:30:38.268 --> 00:30:40.391
Well, the first idea you might think of

00:30:40.391 --> 00:30:42.911
is simply choosing the
hyperparameters that give you

00:30:42.911 --> 00:30:45.032
the best accuracy or best performance

00:30:45.032 --> 00:30:47.691
on your training data.

00:30:47.691 --> 00:30:49.961
This is actually a really terrible idea.

00:30:49.961 --> 00:30:52.137
You should never do this.

00:30:52.137 --> 00:30:54.081
In the concrete case
of the nearest neighbor

00:30:54.081 --> 00:30:55.717
classifier, for example,

00:30:55.717 --> 00:30:59.324
if we set K=1, we will always
classify the training data

00:30:59.324 --> 00:31:00.157
perfectly.

00:31:01.124 --> 00:31:04.420
So if we use this strategy
we'll always pick K=1,

00:31:04.420 --> 00:31:06.390
but, as we saw from the examples earlier,

00:31:06.390 --> 00:31:10.446
in practice it seems that
setting K equals to larger values

00:31:10.446 --> 00:31:13.184
might cause us to misclassify
some of the training data,

00:31:13.184 --> 00:31:14.713
but, in fact, lead to better performance

00:31:14.713 --> 00:31:17.915
on points that were not
in the training data.

00:31:17.915 --> 00:31:19.208
And ultimately in machine learning

00:31:19.208 --> 00:31:21.434
we don't care about
fitting the training data,

00:31:21.434 --> 00:31:23.317
we really care about how our classifier,

00:31:23.317 --> 00:31:24.463
or how our method,

00:31:24.463 --> 00:31:27.051
will perform on unseen
data after training.

00:31:27.051 --> 00:31:30.495
So, this is a terrible
idea, don't do this.

00:31:30.495 --> 00:31:32.687
So, another idea that you might think of,

00:31:32.687 --> 00:31:34.856
is maybe we'll take our full dataset

00:31:34.856 --> 00:31:36.532
and we'll split it into some training data

00:31:36.532 --> 00:31:38.390
and some test data.

00:31:38.390 --> 00:31:42.294
And now I'll try training
my algorithm with different

00:31:42.294 --> 00:31:44.409
choices of hyperparameters
on the training data

00:31:44.409 --> 00:31:47.263
and then I'll go and apply
that trained classifier

00:31:47.263 --> 00:31:49.584
on the test data and now I will pick

00:31:49.584 --> 00:31:52.299
the set of hyperparameters
that cause me to perform best

00:31:52.299 --> 00:31:53.716
on the test data.

00:31:54.582 --> 00:31:57.414
This seems like maybe a
more reasonable strategy,

00:31:57.414 --> 00:31:59.546
but, in fact, this is also a terrible idea

00:31:59.546 --> 00:32:01.087
and you should never do this.

00:32:01.087 --> 00:32:03.723
Because, again, the point
of machine learning systems

00:32:03.723 --> 00:32:06.515
is that we want to know how
our algorithm will perform.

00:32:06.515 --> 00:32:08.017
So, the point of the test set

00:32:08.017 --> 00:32:10.760
is to give us some estimate
of how our method will do

00:32:10.760 --> 00:32:14.523
on unseen data that's
coming out from the wild.

00:32:14.523 --> 00:32:17.710
And if we use this strategy
of training many different

00:32:17.710 --> 00:32:19.749
algorithms with different hyperparameters,

00:32:19.749 --> 00:32:22.133
and then, selecting the
one which does the best

00:32:22.133 --> 00:32:23.363
on the test data,

00:32:23.363 --> 00:32:26.404
then, it's possible, that
we may have just picked

00:32:26.404 --> 00:32:28.045
the right set of hyperparameters

00:32:28.045 --> 00:32:30.319
that caused our algorithm
to work quite well

00:32:30.319 --> 00:32:31.663
on this testing set,

00:32:31.663 --> 00:32:33.776
but now our performance on this test set

00:32:33.776 --> 00:32:35.488
will no longer be representative

00:32:35.488 --> 00:32:38.280
of our performance of new, unseen data.

00:32:38.280 --> 00:32:41.491
So, again, you should not
do this, this is a bad idea,

00:32:41.491 --> 00:32:44.672
you'll get in trouble if you do this.

00:32:44.672 --> 00:32:47.025
What is much more common, is
to actually split your data

00:32:47.025 --> 00:32:49.192
into three different sets.

00:32:50.185 --> 00:32:54.165
You'll partition most of
your data into a training set

00:32:54.165 --> 00:32:56.159
and then you'll create a validation set

00:32:56.159 --> 00:32:57.305
and a test set.

00:32:57.305 --> 00:33:01.465
And now what we typically do
is go and train our algorithm

00:33:01.465 --> 00:33:03.500
with many different
choices of hyperparameters

00:33:03.500 --> 00:33:05.172
on the training set,

00:33:05.172 --> 00:33:07.227
evaluate on the validation set,

00:33:07.227 --> 00:33:08.802
and now pick the set of hyperparameters

00:33:08.802 --> 00:33:11.621
which performs best on the validation set.

00:33:11.621 --> 00:33:13.719
And now, after you've
done all your development,

00:33:13.719 --> 00:33:14.899
you've done all your debugging,

00:33:14.899 --> 00:33:16.627
after you've dome everything,

00:33:16.627 --> 00:33:20.000
then you'd take that best
performing classifier

00:33:20.000 --> 00:33:21.614
on the validation set

00:33:21.614 --> 00:33:23.401
and run it once on the test set.

00:33:23.401 --> 00:33:25.005
And now that's the number
that goes into your paper,

00:33:25.005 --> 00:33:27.139
that's the number that
goes into your report,

00:33:27.139 --> 00:33:29.542
that's the number that
actually is telling you how

00:33:29.542 --> 00:33:32.393
your algorithm is doing on unseen data.

00:33:32.393 --> 00:33:34.335
And this is actually
really, really important

00:33:34.335 --> 00:33:36.516
that you keep a very
strict separation between

00:33:36.516 --> 00:33:38.847
the validation data and the test data.

00:33:38.847 --> 00:33:41.437
So, for example, when we're
working on research papers,

00:33:41.437 --> 00:33:43.303
we typically only touch the test set

00:33:43.303 --> 00:33:45.653
at the very last minute.

00:33:45.653 --> 00:33:47.029
So, when I'm writing papers,

00:33:47.029 --> 00:33:49.472
I tend to only touch the
test set for my problem

00:33:49.472 --> 00:33:51.809
in maybe the week before
the deadline or so

00:33:51.809 --> 00:33:54.159
to really insure that we're not

00:33:54.159 --> 00:33:56.665
being dishonest here and
we're not reporting a number

00:33:56.665 --> 00:33:57.961
which is unfair.

00:33:57.961 --> 00:34:00.102
So, this is actually super important

00:34:00.102 --> 00:34:02.216
and you want to make sure
to keep your test data

00:34:02.216 --> 00:34:03.883
quite under control.

00:34:06.468 --> 00:34:08.721
So another strategy for
setting hyperparameters

00:34:08.721 --> 00:34:10.840
is called cross validation.

00:34:10.840 --> 00:34:13.868
And this is used a
little bit more commonly

00:34:13.868 --> 00:34:17.317
for small data sets, not used
so much in deep learning.

00:34:17.317 --> 00:34:20.201
So here the idea is we're
going to take our test data,

00:34:20.201 --> 00:34:21.674
or we're going to take our dataset,

00:34:21.674 --> 00:34:25.534
as usual, hold out some test
set to use at the very end,

00:34:25.534 --> 00:34:27.023
and now, for the rest of the data,

00:34:27.023 --> 00:34:29.188
rather than splitting it
into a single training

00:34:29.188 --> 00:34:31.265
and validation partition,

00:34:31.266 --> 00:34:33.851
instead, we can split our training data

00:34:33.851 --> 00:34:35.515
into many different folds.

00:34:35.516 --> 00:34:38.879
And now, in this way, we've
cycled through choosing which

00:34:38.879 --> 00:34:41.415
fold is going to be the validation set.

00:34:41.415 --> 00:34:43.286
So now, in this example,

00:34:43.286 --> 00:34:45.498
we're using five fold cross validation,

00:34:45.498 --> 00:34:47.594
so you would train your
algorithm with one set of

00:34:47.594 --> 00:34:49.984
hyperparameters on the first four folds,

00:34:49.985 --> 00:34:51.928
evaluate the performance on fold four,

00:34:51.928 --> 00:34:54.177
and now go and retrain
your algorithm on folds

00:34:54.177 --> 00:34:55.712
one, two, three, and five,

00:34:55.712 --> 00:34:57.769
evaluate on fold four,

00:34:57.769 --> 00:35:00.293
and cycle through all the different folds.

00:35:00.293 --> 00:35:01.765
And, when you do it this way,

00:35:01.765 --> 00:35:04.098
you get much higher confidence about

00:35:04.098 --> 00:35:06.215
which hyperparameters are going to perform

00:35:06.215 --> 00:35:07.511
more robustly.

00:35:07.511 --> 00:35:09.714
So this is kind of the
gold standard to use,

00:35:09.714 --> 00:35:11.749
but, in practice in deep learning

00:35:11.749 --> 00:35:13.285
when we're training large models

00:35:13.285 --> 00:35:15.972
and training is very
computationally expensive,

00:35:15.972 --> 00:35:18.652
these doesn't get used
too much in practice.

00:35:18.652 --> 00:35:19.519
Question?

00:35:19.519 --> 00:35:23.186
[student asking a question]

00:35:29.515 --> 00:35:31.371
Yeah, so the question is,

00:35:31.371 --> 00:35:32.634
a little bit more concretely,

00:35:32.634 --> 00:35:34.320
what's the difference
between the training and the

00:35:34.320 --> 00:35:35.728
validation set?

00:35:35.728 --> 00:35:39.895
So, if you think about the
k-nearest neighbor classifier

00:35:41.060 --> 00:35:45.300
then the training set is this
set of images with labels

00:35:45.300 --> 00:35:46.974
where we memorize the labels.

00:35:46.974 --> 00:35:48.607
And now, to classify an image,

00:35:48.607 --> 00:35:51.405
we're going to take the image
and compare it to each element

00:35:51.405 --> 00:35:52.451
in the training data,

00:35:52.451 --> 00:35:56.618
and then transfer the label
from the nearest training point.

00:36:00.052 --> 00:36:02.102
So now our algorithm
will memorize everything

00:36:02.102 --> 00:36:03.184
in the training set,

00:36:03.184 --> 00:36:05.934
and now we'll take each
element of the validation set

00:36:05.934 --> 00:36:08.062
and compare it to each
element in the training data

00:36:08.062 --> 00:36:12.081
and then use this to
determine what is the accuracy

00:36:12.081 --> 00:36:16.815
of our classifier when it's
applied on the validation set.

00:36:16.815 --> 00:36:18.746
So this is the distinction
between training

00:36:18.746 --> 00:36:19.919
and validation.

00:36:19.919 --> 00:36:22.364
Where your algorithm is
able to see the labels

00:36:22.364 --> 00:36:24.207
of the training set,

00:36:24.207 --> 00:36:25.528
but for the validation set,

00:36:25.528 --> 00:36:28.473
your algorithm doesn't have
direct access to the labels.

00:36:28.473 --> 00:36:30.818
We only use the labels
of the validation set

00:36:30.818 --> 00:36:34.043
to check how well our algorithm is doing.

00:36:34.043 --> 00:36:34.907
A question?

00:36:34.907 --> 00:36:38.574
[student asking a question]

00:36:44.373 --> 00:36:47.736
The question is, whether the test set,

00:36:47.736 --> 00:36:49.442
is it possible that the
test set might not be

00:36:49.442 --> 00:36:52.941
representative of data
out there in the wild?

00:36:52.941 --> 00:36:55.955
This definitely can be
a problem in practice,

00:36:55.955 --> 00:36:58.252
the underlying statistical
assumption here is that

00:36:58.252 --> 00:37:01.863
your data are all independently
and identically distributed,

00:37:01.863 --> 00:37:05.280
so that all of your data points should be

00:37:06.631 --> 00:37:10.125
drawn from the same underlying
probability distribution.

00:37:10.125 --> 00:37:12.948
Of course, in practice, this
might not always be the case,

00:37:12.948 --> 00:37:14.768
and you definitely can run into cases

00:37:14.768 --> 00:37:18.798
where the test set might
not be super representative

00:37:18.798 --> 00:37:20.951
of what you see in the wild.

00:37:20.951 --> 00:37:23.826
So this is kind of a problem
that dataset creators and

00:37:23.826 --> 00:37:25.473
dataset curators need to think about.

00:37:25.473 --> 00:37:27.712
But when I'm creating
datasets, for example,

00:37:27.712 --> 00:37:28.626
one thing I do,

00:37:28.626 --> 00:37:31.454
is I'll go and collect a whole
bunch of data all at once,

00:37:31.454 --> 00:37:34.252
using the exact same methodology
for collecting the data,

00:37:34.252 --> 00:37:36.656
and then afterwards you go
and partition it randomly

00:37:36.656 --> 00:37:38.690
between train and test.

00:37:38.690 --> 00:37:40.962
One thing that can screw you up here is

00:37:40.962 --> 00:37:43.024
maybe if you're collecting data over time

00:37:43.024 --> 00:37:45.489
and you make the earlier
data, that you collect first,

00:37:45.489 --> 00:37:46.343
be the training data,

00:37:46.343 --> 00:37:48.601
and the later data that you
collect be the test data,

00:37:48.601 --> 00:37:50.300
then you actually might
run into this shift

00:37:50.300 --> 00:37:51.613
that could cause problems.

00:37:51.613 --> 00:37:53.971
But as long as this partition is random

00:37:53.971 --> 00:37:55.831
among your entire set of data points,

00:37:55.831 --> 00:37:58.762
then that's how we try
to alleviate this problem

00:37:58.762 --> 00:37:59.762
in practice.

00:38:04.297 --> 00:38:06.619
So then, once you've gone through this

00:38:06.619 --> 00:38:08.351
cross validation procedure,

00:38:08.351 --> 00:38:11.493
then you end up with graphs
that look something like this.

00:38:11.493 --> 00:38:14.856
So here, on the X axis, we
are showing the value of K

00:38:14.856 --> 00:38:17.354
for a k-nearest neighbor
classifier on some problem,

00:38:17.354 --> 00:38:21.565
and now on the Y axis, we are
showing what is the accuracy

00:38:21.565 --> 00:38:25.102
of our classifier on some dataset

00:38:25.102 --> 00:38:26.961
for different values of K.

00:38:26.961 --> 00:38:28.332
And you can see that, in this case,

00:38:28.332 --> 00:38:31.705
we've done five fold cross
validation over the data,

00:38:31.705 --> 00:38:34.639
so, for each value of K we
have five different examples

00:38:34.639 --> 00:38:38.346
of how well this algorithm is doing.

00:38:38.346 --> 00:38:41.050
And, actually, going back
to the question about

00:38:41.050 --> 00:38:43.165
having some test sets
that are better or worse

00:38:43.165 --> 00:38:44.748
for your algorithm,

00:38:46.239 --> 00:38:47.683
using K fold cross validation

00:38:47.683 --> 00:38:50.276
is maybe one way to help
quantify that a little bit.

00:38:50.276 --> 00:38:54.935
And, in that, we can see the
variance of how this algorithm

00:38:54.935 --> 00:38:57.606
performs on different
of the validation folds.

00:38:57.606 --> 00:38:58.981
And that gives you some sense of,

00:38:58.981 --> 00:39:00.359
not just what is the best,

00:39:00.359 --> 00:39:04.301
but, also, what is the
distribution of that performance.

00:39:04.301 --> 00:39:06.442
So, whenever you're training
machine learning models

00:39:06.442 --> 00:39:08.151
you end up making plots like this,

00:39:08.151 --> 00:39:09.744
where they show you what is your accuracy,

00:39:09.744 --> 00:39:12.380
or your performance as a
function of your hyperparameters,

00:39:12.380 --> 00:39:14.857
and then you want to
go and pick the model,

00:39:14.857 --> 00:39:16.151
or the set of hyperparameters,

00:39:16.151 --> 00:39:17.198
at the end of the day,

00:39:17.198 --> 00:39:19.995
that performs the best
on the validation set.

00:39:19.995 --> 00:39:23.195
So, here we see that maybe
about K=7 probably works

00:39:23.195 --> 00:39:25.528
about best for this problem.

00:39:29.713 --> 00:39:32.007
So, k-nearest neighbor
classifiers on images

00:39:32.007 --> 00:39:34.458
are actually almost
never used in practice.

00:39:34.458 --> 00:39:38.233
Because, with all of these
problems that we've talked about.

00:39:38.233 --> 00:39:41.393
So, one problem is that
it's very slow at test time,

00:39:41.393 --> 00:39:43.115
which is the reverse of what we want,

00:39:43.115 --> 00:39:44.287
which we talked about earlier.

00:39:44.287 --> 00:39:45.582
Another problem is that

00:39:45.582 --> 00:39:48.859
these things like Euclidean
distance, or L1 distance,

00:39:48.859 --> 00:39:51.648
are really not a very good way

00:39:51.648 --> 00:39:54.495
to measure distances between images.

00:39:54.495 --> 00:39:57.247
These, sort of, vectorial
distance functions

00:39:57.247 --> 00:39:59.857
do not correspond very well
to perceptual similarity

00:39:59.857 --> 00:40:01.254
between images.

00:40:01.254 --> 00:40:04.076
How you perceive
differences between images.

00:40:04.076 --> 00:40:06.910
So, in this example, we've constructed,

00:40:06.910 --> 00:40:08.796
there's this image on the left of a girl,

00:40:08.796 --> 00:40:11.234
and then three different
distorted images on the right

00:40:11.234 --> 00:40:12.930
where we've blocked out her mouth,

00:40:12.930 --> 00:40:16.029
we've actually shifted
down by a couple pixels,

00:40:16.029 --> 00:40:18.416
or tinted the entire image blue.

00:40:18.416 --> 00:40:20.849
And, actually, if you compute
the Euclidean distance

00:40:20.849 --> 00:40:22.785
between the original and the boxed,

00:40:22.785 --> 00:40:24.145
the original and the shuffled,

00:40:24.145 --> 00:40:25.425
and original in the tinted,

00:40:25.425 --> 00:40:27.735
they all have the same L2 distance.

00:40:27.735 --> 00:40:29.469
Which is, maybe, not so good

00:40:29.469 --> 00:40:32.585
because it sort of
gives you the sense that

00:40:32.585 --> 00:40:34.952
the L2 distance is really
not doing a very good job

00:40:34.952 --> 00:40:39.119
at capturing these perceptional
distances between images.

00:40:40.642 --> 00:40:42.818
Another, sort of, problem
with the k-nearest neighbor

00:40:42.818 --> 00:40:45.338
classifier has to do with
something we call the curse

00:40:45.338 --> 00:40:46.819
of dimensionality.

00:40:46.819 --> 00:40:49.744
So, if you recall back this
viewpoint we had of the

00:40:49.744 --> 00:40:51.279
k-nearest neighbor classifier,

00:40:51.279 --> 00:40:53.811
it's sort of dropping paint
around each of the training

00:40:53.811 --> 00:40:57.186
data points and using that to
sort of partition the space.

00:40:57.186 --> 00:40:59.517
So that means that if we
expect the k-nearest neighbor

00:40:59.517 --> 00:41:01.105
classifier to work well,

00:41:01.105 --> 00:41:04.160
we kind of need our training
examples to cover the space

00:41:04.160 --> 00:41:05.646
quite densely.

00:41:05.646 --> 00:41:10.525
Otherwise our nearest neighbors
could actually be quite far

00:41:10.525 --> 00:41:14.171
away and might not actually
be very similar to our testing

00:41:14.171 --> 00:41:15.004
points.

00:41:16.738 --> 00:41:17.571
And the problem is,

00:41:17.571 --> 00:41:19.197
that actually densely covering the space,

00:41:19.197 --> 00:41:21.467
means that we need a number
of training examples,

00:41:21.467 --> 00:41:24.666
which is exponential in the
dimension of the problem.

00:41:24.666 --> 00:41:29.217
So this is very bad, exponential
growth is always bad,

00:41:29.217 --> 00:41:31.310
basically, you're never
going to get enough images

00:41:31.310 --> 00:41:33.422
to densely cover this space of pixels

00:41:33.422 --> 00:41:35.587
in this high dimensional space.

00:41:35.587 --> 00:41:37.487
So that's maybe another
thing to keep in mind

00:41:37.487 --> 00:41:41.300
when you're thinking about
using k-nearest neighbor.

00:41:41.300 --> 00:41:42.749
So, kind of the summary
is that we're using

00:41:42.749 --> 00:41:44.460
k-nearest neighbor to introduce this idea

00:41:44.460 --> 00:41:45.701
of image classification.

00:41:45.701 --> 00:41:47.634
We have a training set
of images and labels

00:41:47.634 --> 00:41:48.926
and then we use that

00:41:48.926 --> 00:41:51.445
to predict these labels on the test set.

00:41:51.445 --> 00:41:52.278
Question?

00:41:52.278 --> 00:41:54.519
[student asking a question]

00:41:54.519 --> 00:41:55.352
Oh, sorry, the question is,

00:41:55.352 --> 00:41:57.271
what was going on with this picture?

00:41:57.271 --> 00:41:59.168
What are the green and the blue dots?

00:41:59.168 --> 00:42:02.335
So here, we have some training samples

00:42:03.463 --> 00:42:05.596
which are represented by points,

00:42:05.596 --> 00:42:08.128
and the color of the dot
maybe represents the category

00:42:08.128 --> 00:42:11.004
of the point, of this training sample.

00:42:11.004 --> 00:42:12.629
So, if we're in one dimension,

00:42:12.629 --> 00:42:15.161
then you maybe only need
four training samples

00:42:15.161 --> 00:42:17.525
to densely cover the space,

00:42:17.525 --> 00:42:19.478
but if we move to two dimensions,

00:42:19.478 --> 00:42:22.701
then, we now need, four times
four is 16 training examples

00:42:22.701 --> 00:42:25.529
to densely cover this space.

00:42:25.529 --> 00:42:28.767
And if we move to three, four,
five, many more dimensions,

00:42:28.767 --> 00:42:30.431
the number of training
examples that we need

00:42:30.431 --> 00:42:32.344
to densely cover the space,

00:42:32.344 --> 00:42:35.200
grows exponentially with the dimension.

00:42:35.200 --> 00:42:36.536
So, this is kind of giving you the sense,

00:42:36.536 --> 00:42:37.601
that maybe in two dimensions

00:42:37.601 --> 00:42:40.501
we might have this kind
of funny curved shape,

00:42:40.501 --> 00:42:44.848
or you might have sort of
arbitrary manifolds of labels

00:42:44.848 --> 00:42:47.641
in different dimensional spaces.

00:42:47.641 --> 00:42:49.403
Because the k-nearest neighbor algorithm

00:42:49.403 --> 00:42:51.704
doesn't really make any
assumptions about these

00:42:51.704 --> 00:42:53.029
underlying manifolds,

00:42:53.029 --> 00:42:54.565
the only way it can perform properly

00:42:54.565 --> 00:42:57.713
is if it has quite a dense
sample of training points

00:42:57.713 --> 00:42:58.796
to work with.

00:43:01.741 --> 00:43:04.915
So, this is kind of the
overview of k-nearest neighbors

00:43:04.915 --> 00:43:07.015
and you'll get a chance
to actually implement this

00:43:07.015 --> 00:43:11.214
and try it out on images
in the first assignment.

00:43:11.214 --> 00:43:12.953
So, if there's any last minute
questions about K and N,

00:43:12.953 --> 00:43:16.059
I'm going to move on to the next topic.

00:43:16.059 --> 00:43:16.892
Question?

00:43:16.892 --> 00:43:21.014
[student is asking a question]

00:43:21.014 --> 00:43:22.034
Sorry, say that again.

00:43:22.034 --> 00:43:25.951
[student is asking a question]

00:43:28.437 --> 00:43:29.270
Yeah, so the question is,

00:43:29.270 --> 00:43:32.033
why do these images have
the same L2 distance?

00:43:32.033 --> 00:43:34.036
And the answer is that, I
carefully constructed them

00:43:34.036 --> 00:43:35.915
to have the same L2 distance.

00:43:35.915 --> 00:43:38.716
[laughing]

00:43:38.716 --> 00:43:41.812
But it's just giving you the
sense that the L2 distance

00:43:41.812 --> 00:43:45.096
is not a very good measure
of similarity between images.

00:43:45.096 --> 00:43:47.306
And these images are
actually all different from

00:43:47.306 --> 00:43:50.223
each other in quite disparate ways.

00:43:52.470 --> 00:43:54.498
If you're using K and N,

00:43:54.498 --> 00:43:56.654
then the only thing you
have to measure distance

00:43:56.654 --> 00:43:57.649
between images,

00:43:57.649 --> 00:44:00.236
is this single distance metric.

00:44:00.236 --> 00:44:03.200
And this kind of gives
you an example where

00:44:03.200 --> 00:44:05.229
that distance metric is
actually not capturing

00:44:05.229 --> 00:44:07.331
the full description of
distance or difference

00:44:07.331 --> 00:44:08.949
between images.

00:44:08.949 --> 00:44:12.042
So, if this case, I just sort
of carefully constructed these

00:44:12.042 --> 00:44:15.384
translations and these
offsets to match exactly.

00:44:15.384 --> 00:44:16.217
Question?

00:44:16.217 --> 00:44:19.884
[student asking a question]

00:44:28.672 --> 00:44:29.925
So, the question is,

00:44:29.925 --> 00:44:31.249
maybe this is actually good,

00:44:31.249 --> 00:44:33.228
because all of these things

00:44:33.228 --> 00:44:36.615
are actually having the
same distance to the image.

00:44:36.615 --> 00:44:38.024
That's maybe true for this example,

00:44:38.024 --> 00:44:40.390
but I think you could also
construct examples where

00:44:40.390 --> 00:44:41.810
maybe we have two original images

00:44:41.810 --> 00:44:43.352
and then by putting the
boxes in the right places

00:44:43.352 --> 00:44:44.270
or tinting them,

00:44:44.270 --> 00:44:46.652
we could cause it to be
nearer to pretty much

00:44:46.652 --> 00:44:48.316
anything that you want, right?

00:44:48.316 --> 00:44:50.850
Because in this example, we
can kind of like do arbitrary

00:44:50.850 --> 00:44:52.007
shifting and tinting

00:44:52.007 --> 00:44:54.612
to kind of change these
distances nearly arbitrarily

00:44:54.612 --> 00:44:57.469
without changing the perceptional
nature of these images.

00:44:57.469 --> 00:44:59.256
So, I think that this
can actually screw you up

00:44:59.256 --> 00:45:03.172
if you have many
different original images.

00:45:03.172 --> 00:45:04.039
Question?

00:45:04.039 --> 00:45:07.956
[student is asking a question]

00:45:15.207 --> 00:45:16.040
The question is,

00:45:16.040 --> 00:45:17.339
whether or not it's
common in real-world cases

00:45:17.339 --> 00:45:20.664
to go back and retrain the entire dataset

00:45:20.664 --> 00:45:24.098
once you've found those
best hyperparameters?

00:45:24.098 --> 00:45:27.765
So, people do sometimes
do this in practice,

00:45:28.787 --> 00:45:30.982
but it's somewhat a matter of taste.

00:45:30.982 --> 00:45:32.568
If you're really rushing for that deadline

00:45:32.568 --> 00:45:34.430
and you've really got to
get this model out the door

00:45:34.430 --> 00:45:36.781
then, if it takes a long
time to retrain the model

00:45:36.781 --> 00:45:38.176
on the whole dataset,

00:45:38.176 --> 00:45:39.875
then maybe you won't do it.

00:45:39.875 --> 00:45:41.776
But if you have a little
bit more time to spare

00:45:41.776 --> 00:45:43.432
and a little bit more compute to spare,

00:45:43.432 --> 00:45:45.929
and you want to squeeze out
that maybe that extra 1%

00:45:45.929 --> 00:45:50.012
of performance, then that
is a trick you can use.

00:45:53.288 --> 00:45:54.858
So we kind of saw that
the k-nearest neighbor

00:45:54.858 --> 00:45:56.758
has a lot of the nice properties

00:45:56.758 --> 00:45:58.409
of machine learning algorithms,

00:45:58.409 --> 00:46:00.490
but in practice it's not so great,

00:46:00.490 --> 00:46:03.823
and really not used very much in images.

00:46:05.258 --> 00:46:07.134
So the next thing I'd
like to talk about is

00:46:07.134 --> 00:46:08.895
linear classification.

00:46:08.895 --> 00:46:11.753
And linear classification is,
again, quite a simple learning

00:46:11.753 --> 00:46:14.981
algorithm, but this will
become super important

00:46:14.981 --> 00:46:17.257
and help us build up to
whole neural networks

00:46:17.257 --> 00:46:19.845
and whole convolutional networks.

00:46:19.845 --> 00:46:21.736
So, one analogy people often talk about

00:46:21.736 --> 00:46:23.470
when working with neural networks

00:46:23.470 --> 00:46:26.242
is we think of them as being
kind of like Lego blocks.

00:46:26.242 --> 00:46:28.154
That you can have different
kinds of components

00:46:28.154 --> 00:46:30.345
of neural networks and you
can stick these components

00:46:30.345 --> 00:46:33.814
together to build these
large different towers of

00:46:33.814 --> 00:46:36.378
convolutional networks.

00:46:36.378 --> 00:46:38.404
One of the most basic
building blocks that we'll see

00:46:38.404 --> 00:46:41.513
in different types of
deep learning applications

00:46:41.513 --> 00:46:43.215
is this linear classifier.

00:46:43.215 --> 00:46:44.996
So, I think it's actually
really important to

00:46:44.996 --> 00:46:46.905
have a good understanding
of what's happening

00:46:46.905 --> 00:46:48.270
with linear classification.

00:46:48.270 --> 00:46:50.236
Because these will end up
generalizing quite nicely

00:46:50.236 --> 00:46:52.712
to whole neural networks.

00:46:52.712 --> 00:46:55.243
So another example of kind
of this modular nature

00:46:55.243 --> 00:46:56.139
of neural networks

00:46:56.139 --> 00:46:58.728
comes from some research in our
own lab on image captioning,

00:46:58.728 --> 00:47:00.281
just as a little bit of a preview.

00:47:00.281 --> 00:47:02.847
So here the setup is that
we want to input an image

00:47:02.847 --> 00:47:04.789
and then output a descriptive sentence

00:47:04.789 --> 00:47:06.226
describing the image.

00:47:06.226 --> 00:47:08.162
And the way this kind of works is that

00:47:08.162 --> 00:47:10.710
we have one convolutional
neural network that's looking

00:47:10.710 --> 00:47:11.548
at the image,

00:47:11.548 --> 00:47:13.444
and a recurrent neural network that knows

00:47:13.444 --> 00:47:14.496
about language.

00:47:14.496 --> 00:47:16.664
And we can kind of just stick
these two pieces together

00:47:16.664 --> 00:47:18.776
like Lego blocks and train
the whole thing together

00:47:18.776 --> 00:47:20.919
and end up with a pretty cool system

00:47:20.919 --> 00:47:22.767
that can do some non-trivial things.

00:47:22.767 --> 00:47:25.077
And we'll work through the
details of this model as we go

00:47:25.077 --> 00:47:26.388
forward in the class,

00:47:26.388 --> 00:47:27.711
but this just gives you the sense that,

00:47:27.711 --> 00:47:30.209
these deep neural networks
are kind of like Legos

00:47:30.209 --> 00:47:31.999
and this linear classifier

00:47:31.999 --> 00:47:34.082
is kind of like the most
basic building blocks

00:47:34.082 --> 00:47:36.082
of these giant networks.

00:47:37.096 --> 00:47:39.189
But that's a little bit too
exciting for lecture two,

00:47:39.189 --> 00:47:41.257
so we have to go back to
CIFAR-10 for the moment.

00:47:41.257 --> 00:47:42.375
[laughing]

00:47:42.375 --> 00:47:45.641
So, recall that CIFAR-10 has
these 50,000 training examples,

00:47:45.641 --> 00:47:49.808
each image is 32 by 32 pixels
and three color channels.

00:47:52.068 --> 00:47:53.963
In linear classification,
we're going to take a bit

00:47:53.963 --> 00:47:56.696
of a different approach
from k-nearest neighbor.

00:47:56.696 --> 00:48:01.646
So, the linear classifier is
one of the simplest examples

00:48:01.646 --> 00:48:04.734
of what we call a parametric model.

00:48:04.734 --> 00:48:07.548
So now, our parametric model
actually has two different

00:48:07.548 --> 00:48:08.685
components.

00:48:08.685 --> 00:48:12.201
It's going to take in this image,
maybe, of a cat on the left,

00:48:12.201 --> 00:48:13.539
and this,

00:48:13.539 --> 00:48:17.384
that we usually write
as X for our input data,

00:48:17.384 --> 00:48:20.220
and also a set of parameters, or weights,

00:48:20.220 --> 00:48:23.475
which is usually called
W, also sometimes theta,

00:48:23.475 --> 00:48:24.767
depending on the literature.

00:48:24.767 --> 00:48:26.974
And now we're going to
write down some function

00:48:26.974 --> 00:48:30.780
which takes in both the data,
X, and the parameters, W,

00:48:30.780 --> 00:48:34.180
and this'll spit out now
10 numbers describing

00:48:34.180 --> 00:48:37.950
what are the scores
corresponding to each of those 10

00:48:37.950 --> 00:48:39.991
categories in CIFAR-10.

00:48:39.991 --> 00:48:44.232
With the interpretation that,
like the larger score for cat,

00:48:44.232 --> 00:48:48.583
indicates a larger probability
of that input X being cat.

00:48:48.583 --> 00:48:49.827
And now, a question?

00:48:49.827 --> 00:48:53.494
[student asking a question]

00:48:55.380 --> 00:48:56.717
Sorry, can you repeat that?

00:48:56.717 --> 00:48:58.863
[student asking a question]

00:48:58.863 --> 00:49:01.524
Oh, so the question is what is the three?

00:49:01.524 --> 00:49:04.986
The three, in this example,
corresponds to the three color

00:49:04.986 --> 00:49:06.943
channels, red, green, and blue.

00:49:06.943 --> 00:49:08.469
Because we typically work on color images,

00:49:08.469 --> 00:49:12.636
that's nice information that
you don't want to throw away.

00:49:15.323 --> 00:49:17.243
So, in the k-nearest neighbor setup

00:49:17.243 --> 00:49:18.999
there was no parameters, instead,

00:49:18.999 --> 00:49:21.686
we just kind of keep around
the whole training data,

00:49:21.686 --> 00:49:22.657
the whole training set,

00:49:22.657 --> 00:49:24.092
and use that at test time.

00:49:24.092 --> 00:49:25.825
But now, in a parametric approach,

00:49:25.825 --> 00:49:28.196
we're going to summarize our
knowledge of the training data

00:49:28.196 --> 00:49:31.105
and stick all that knowledge
into these parameters, W.

00:49:31.105 --> 00:49:33.607
And now, at test time, we
no longer need the actual

00:49:33.607 --> 00:49:35.371
training data, we can throw it away.

00:49:35.371 --> 00:49:37.938
We only need these
parameters, W, at test time.

00:49:37.938 --> 00:49:40.561
So this allows our models
to now be more efficient

00:49:40.561 --> 00:49:44.684
and actually run on maybe
small devices like phones.

00:49:44.684 --> 00:49:47.277
So, kind of, the whole
story in deep learning

00:49:47.277 --> 00:49:49.404
is coming up with the
right structure for this

00:49:49.404 --> 00:49:50.906
function, F.

00:49:50.906 --> 00:49:53.451
You can imagine writing down
different functional forms

00:49:53.451 --> 00:49:55.406
for how to combine weights
and data in different

00:49:55.406 --> 00:49:58.608
complex ways, and these
could correspond to different

00:49:58.608 --> 00:50:01.169
network architectures.

00:50:01.169 --> 00:50:02.489
But the simplest possible example

00:50:02.489 --> 00:50:04.132
of combining these two things

00:50:04.132 --> 00:50:05.729
is just, maybe, to multiply them.

00:50:05.729 --> 00:50:08.833
And this is a linear classifier.

00:50:08.833 --> 00:50:13.181
So here our F of X, W is
just equal to the W times X.

00:50:13.181 --> 00:50:15.770
Probably the simplest
equation you can imagine.

00:50:15.770 --> 00:50:16.603
So here,

00:50:16.603 --> 00:50:18.921
if you kind of unpack the
dimensions of these things,

00:50:18.921 --> 00:50:23.088
we recall that our image was
maybe 32 by 32 by 3 values.

00:50:24.871 --> 00:50:28.207
So then, we're going to take
those values and then stretch

00:50:28.207 --> 00:50:31.424
them out into a long column vector

00:50:31.424 --> 00:50:34.715
that has 3,072 by one entries.

00:50:34.715 --> 00:50:38.632
And now we want to end
up with 10 class scores.

00:50:39.746 --> 00:50:41.742
We want to end up with
10 numbers for this image

00:50:41.742 --> 00:50:44.236
giving us the scores for
each of the 10 categories.

00:50:44.236 --> 00:50:46.279
Which means that now our matrix, W,

00:50:46.279 --> 00:50:49.032
needs to be ten by 3072.

00:50:49.032 --> 00:50:51.299
So that once we multiply
these two things out

00:50:51.299 --> 00:50:53.243
then we'll end up with
a single column vector

00:50:53.243 --> 00:50:57.086
10 by one, giving us our 10 class scores.

00:50:57.086 --> 00:51:00.317
Also sometimes, you'll typically see this,

00:51:00.317 --> 00:51:01.910
we'll often add a bias term

00:51:01.910 --> 00:51:04.724
which will be a constant
vector of 10 elements

00:51:04.724 --> 00:51:06.669
that does not interact
with the training data,

00:51:06.669 --> 00:51:09.656
and instead just gives us
some sort of data independent

00:51:09.656 --> 00:51:12.235
preferences for some classes over another.

00:51:12.235 --> 00:51:13.878
So you might imagine that
if you're dataset was

00:51:13.878 --> 00:51:16.300
unbalanced and had many
more cats than dogs,

00:51:16.300 --> 00:51:19.259
for example, then the bias
elements corresponding

00:51:19.259 --> 00:51:23.553
to cat would be higher
than the other ones.

00:51:23.553 --> 00:51:25.445
So if you kind of think about pictorially

00:51:25.445 --> 00:51:27.778
what this function is doing,

00:51:28.930 --> 00:51:31.483
in this figure we have
an example on the left

00:51:31.483 --> 00:51:35.729
of a simple image with
just a two by two image,

00:51:35.729 --> 00:51:37.099
so it has four pixels total.

00:51:37.099 --> 00:51:39.832
So the way that the
linear classifier works

00:51:39.832 --> 00:51:42.273
is that we take this two by two image,

00:51:42.273 --> 00:51:45.202
we stretch it out into a column vector

00:51:45.202 --> 00:51:46.577
with four elements,

00:51:46.577 --> 00:51:49.301
and now, in this example,
we are just restricting to

00:51:49.301 --> 00:51:51.765
three classes, cat, dog, and ship,

00:51:51.765 --> 00:51:54.030
because you can't fit 10 on a slide,

00:51:54.030 --> 00:51:58.197
and now our weight matrix is
going to be four by three,

00:51:59.890 --> 00:52:02.236
so we have four pixels and three classes.

00:52:02.236 --> 00:52:05.567
And now, again, we have a
three element bias vector

00:52:05.567 --> 00:52:08.858
that gives us data independent bias terms

00:52:08.858 --> 00:52:11.125
for each category.

00:52:11.125 --> 00:52:13.467
Now we see that the cat score
is going to be the enter

00:52:13.467 --> 00:52:16.717
product between the pixels of our image

00:52:18.436 --> 00:52:20.650
and this row in the weight matrix

00:52:20.650 --> 00:52:22.814
added together with this bias term.

00:52:22.814 --> 00:52:25.134
So, when you look at it this way

00:52:25.134 --> 00:52:28.146
you can kind of understand
linear classification

00:52:28.146 --> 00:52:30.157
as almost a template matching approach.

00:52:30.157 --> 00:52:32.444
Where each of the rows in this matrix

00:52:32.444 --> 00:52:35.652
correspond to some template of the image.

00:52:35.652 --> 00:52:38.113
And now the enter product or dot product

00:52:38.113 --> 00:52:41.099
between the row of the
matrix and the column

00:52:41.099 --> 00:52:43.183
giving the pixels of the image,

00:52:43.183 --> 00:52:45.165
computing this dot
product kind of gives us

00:52:45.165 --> 00:52:47.874
a similarity between this
template for the class

00:52:47.874 --> 00:52:50.458
and the pixels of our image.

00:52:50.458 --> 00:52:52.906
And then bias just,
again, gives you this data

00:52:52.906 --> 00:52:57.073
independence scaling offset
to each of the classes.

00:53:00.837 --> 00:53:02.401
If we think about linear classification

00:53:02.401 --> 00:53:04.705
from this viewpoint of template matching

00:53:04.705 --> 00:53:07.532
we can actually take the
rows of that weight matrix

00:53:07.532 --> 00:53:09.965
and unravel them back into images

00:53:09.965 --> 00:53:12.799
and actually visualize
those templates as images.

00:53:12.799 --> 00:53:14.734
And this gives us some
sense of what a linear

00:53:14.734 --> 00:53:16.769
classifier might actually be doing

00:53:16.769 --> 00:53:18.908
to try to understand our data.

00:53:18.908 --> 00:53:21.057
So, in this example, we've
gone ahead and trained

00:53:21.057 --> 00:53:23.208
a linear classifier on our images.

00:53:23.208 --> 00:53:25.355
And now on the bottom we're visualizing

00:53:25.355 --> 00:53:28.974
what are those rows in
that learned weight matrix

00:53:28.974 --> 00:53:30.892
corresponding to each of the 10 categories

00:53:30.892 --> 00:53:32.274
in CIFAR-10.

00:53:32.274 --> 00:53:34.117
And in this way we kind
of get a sense for what's

00:53:34.117 --> 00:53:35.686
going on in these images.

00:53:35.686 --> 00:53:38.841
So, for example, in the
left, on the bottom left,

00:53:38.841 --> 00:53:40.978
we see the template for the plane class,

00:53:40.978 --> 00:53:42.941
kind of consists of this like blue blob,

00:53:42.941 --> 00:53:44.856
this kind of blobby thing in the middle

00:53:44.856 --> 00:53:46.739
and maybe blue in the background,

00:53:46.739 --> 00:53:49.212
which gives you the sense
that this linear classifier

00:53:49.212 --> 00:53:51.574
for plane is maybe looking for blue stuff

00:53:51.574 --> 00:53:54.585
and blobby stuff, and those
features are going to cause

00:53:54.585 --> 00:53:57.606
the classifier to like planes more.

00:53:57.606 --> 00:53:59.444
Or if we look at this car example,

00:53:59.444 --> 00:54:02.441
we kind of see that
there's a red blobby thing

00:54:02.441 --> 00:54:04.801
through the middle and a
blue blobby thing at the top

00:54:04.801 --> 00:54:08.721
that maybe is kind of a blurry windshield.

00:54:08.721 --> 00:54:09.654
But this is a little bit weird,

00:54:09.654 --> 00:54:11.271
this doesn't really look like a car.

00:54:11.271 --> 00:54:13.716
No individual car
actually looks like this.

00:54:13.716 --> 00:54:15.628
So the problem is that
the linear classifier

00:54:15.628 --> 00:54:18.317
is only learning one
template for each class.

00:54:18.317 --> 00:54:20.823
So if there's sort of
variations in how that class

00:54:20.823 --> 00:54:21.748
might appear,

00:54:21.748 --> 00:54:24.340
it's trying to average out all
those different variations,

00:54:24.340 --> 00:54:25.658
all those different appearances,

00:54:25.658 --> 00:54:27.461
and use just one single template

00:54:27.461 --> 00:54:29.675
to recognize each of those categories.

00:54:29.675 --> 00:54:31.961
We can also see this pretty
explicitly in the horse

00:54:31.961 --> 00:54:33.139
classifier.

00:54:33.139 --> 00:54:35.776
So in the horse classifier we
see green stuff on the bottom

00:54:35.776 --> 00:54:37.340
because horses are usually on grass.

00:54:37.340 --> 00:54:39.289
And then, if you look
carefully, the horse actually

00:54:39.289 --> 00:54:43.125
seems to have maybe two
heads, one head on each side.

00:54:43.125 --> 00:54:45.855
And I've never seen a
horse with two heads.

00:54:45.855 --> 00:54:48.063
But the linear classifier
is just doing the best

00:54:48.063 --> 00:54:50.513
that it can, because it's
only allowed to learn

00:54:50.513 --> 00:54:52.788
one template per category.

00:54:52.788 --> 00:54:55.085
And as we move forward
into neural networks

00:54:55.085 --> 00:54:56.367
and more complex models,

00:54:56.367 --> 00:54:59.460
we'll be able to achieve
much better accuracy

00:54:59.460 --> 00:55:01.230
because they no longer
have this restriction

00:55:01.230 --> 00:55:05.230
of just learning a single
template per category.

00:55:09.030 --> 00:55:11.223
Another viewpoint of the linear classifier

00:55:11.223 --> 00:55:13.523
is to go back to this idea of images

00:55:13.523 --> 00:55:15.649
as points and high dimensional space.

00:55:15.649 --> 00:55:19.232
And you can imagine
that each of our images

00:55:20.191 --> 00:55:23.328
is something like a point in
this high dimensional space.

00:55:23.328 --> 00:55:26.341
And now the linear classifier
is putting in these

00:55:26.341 --> 00:55:29.305
linear decision boundaries
to try to draw linear

00:55:29.305 --> 00:55:31.402
separation between one category

00:55:31.402 --> 00:55:33.125
and the rest of the categories.

00:55:33.125 --> 00:55:35.838
So maybe up on the upper-left hand side

00:55:35.838 --> 00:55:38.897
we see these training
examples of airplanes

00:55:38.897 --> 00:55:41.446
and throughout the process of training

00:55:41.446 --> 00:55:43.845
the linear classier will
go and try to draw this

00:55:43.845 --> 00:55:46.692
blue line to separate
out with a single line

00:55:46.692 --> 00:55:49.493
the airplane class from all
the rest of the classes.

00:55:49.493 --> 00:55:51.492
And it's actually kind of
fun if you watch during

00:55:51.492 --> 00:55:53.902
the training process these
lines will start out randomly

00:55:53.902 --> 00:55:55.818
and then go and snap into
place to try to separate

00:55:55.818 --> 00:55:57.318
the data properly.

00:55:58.709 --> 00:56:00.921
But when you think about
linear classification

00:56:00.921 --> 00:56:04.000
in this way, from this high
dimensional point of view,

00:56:04.000 --> 00:56:06.768
you can start to see again
what are some of the problems

00:56:06.768 --> 00:56:09.758
that might come up with
linear classification.

00:56:09.758 --> 00:56:11.672
And it's not too hard
to construct examples

00:56:11.672 --> 00:56:15.232
of datasets where a linear
classifier will totally fail.

00:56:15.232 --> 00:56:16.998
So, one example, on the left here,

00:56:16.998 --> 00:56:20.324
is that, suppose we have a
dataset of two categories,

00:56:20.324 --> 00:56:22.067
and these are all maybe
somewhat artificial,

00:56:22.067 --> 00:56:24.990
but maybe our dataset has two categories,

00:56:24.990 --> 00:56:26.095
blue and red.

00:56:26.095 --> 00:56:30.414
And the blue categories
are the number of pixels

00:56:30.414 --> 00:56:33.122
in the image, which are
greater than zero, is odd.

00:56:33.122 --> 00:56:34.944
And anything where the
number of pixels greater

00:56:34.944 --> 00:56:38.714
than zero is even, we want to
classify as the red category.

00:56:38.714 --> 00:56:42.881
So if you actually go and
draw what these different

00:56:43.991 --> 00:56:46.259
decisions regions look like in the plane,

00:56:46.259 --> 00:56:49.907
you can see that our blue class
with an odd number of pixels

00:56:49.907 --> 00:56:53.426
is going to be these two
quadrants in the plane,

00:56:53.426 --> 00:56:56.063
and even will be the
opposite two quadrants.

00:56:56.063 --> 00:56:59.609
So now, there's no way that we
can draw a single linear line

00:56:59.609 --> 00:57:01.364
to separate the blue from the red.

00:57:01.364 --> 00:57:03.412
So this would be an example
where a linear classifier

00:57:03.412 --> 00:57:05.273
would really struggle.

00:57:05.273 --> 00:57:09.535
And this is maybe not such an
artificial thing after all.

00:57:09.535 --> 00:57:10.912
Instead of counting pixels,

00:57:10.912 --> 00:57:13.042
maybe we're actually trying
to count whether the number

00:57:13.042 --> 00:57:16.424
of animals or people in
an image is odd or even.

00:57:16.424 --> 00:57:19.004
So this kind of a parity problem

00:57:19.004 --> 00:57:21.145
of separating odds from evens

00:57:21.145 --> 00:57:22.725
is something that linear classification

00:57:22.725 --> 00:57:25.725
really struggles with traditionally.

00:57:28.376 --> 00:57:31.438
Other situations where a linear
classifier really struggles

00:57:31.438 --> 00:57:33.974
are multimodal situations.

00:57:33.974 --> 00:57:35.146
So here on the right,

00:57:35.146 --> 00:57:39.541
maybe our blue category has
these three different islands

00:57:39.541 --> 00:57:41.915
of where the blue category lives,

00:57:41.915 --> 00:57:44.791
and then everything else
is some other category.

00:57:44.791 --> 00:57:46.529
So, for something like horses,

00:57:46.529 --> 00:57:47.961
we saw on the previous example,

00:57:47.961 --> 00:57:50.106
is something where this
actually might be happening

00:57:50.106 --> 00:57:50.959
in practice.

00:57:50.959 --> 00:57:53.560
Where there's maybe one
island in the pixel space of

00:57:53.560 --> 00:57:55.159
horses looking to the left,

00:57:55.159 --> 00:57:57.268
and another island of
horses looking to the right.

00:57:57.268 --> 00:58:00.360
And now there's no good
way to draw a single linear

00:58:00.360 --> 00:58:03.757
boundary between these two
isolated islands of data.

00:58:03.757 --> 00:58:07.257
So anytime where you have multimodal data,

00:58:08.098 --> 00:58:08.931
like one class

00:58:08.931 --> 00:58:10.854
that can appear in
different regions of space,

00:58:10.854 --> 00:58:13.963
is another place where linear
classifiers might struggle.

00:58:13.963 --> 00:58:15.932
So there's kind of a lot of problems with

00:58:15.932 --> 00:58:18.575
linear classifiers, but it
is a super simple algorithm,

00:58:18.575 --> 00:58:22.259
super nice and easy to interpret
and easy to understand.

00:58:22.259 --> 00:58:24.412
So you'll actually be
implementing these things

00:58:24.412 --> 00:58:27.245
on your first homework assignment.

00:58:28.852 --> 00:58:29.971
At this point,

00:58:29.971 --> 00:58:30.980
we kind of talked about

00:58:30.980 --> 00:58:33.313
what is the functional
form corresponding to a

00:58:33.313 --> 00:58:34.402
linear classifier.

00:58:34.402 --> 00:58:36.711
And we've seen that this functional form

00:58:36.711 --> 00:58:39.185
of matrix vector multiply

00:58:39.185 --> 00:58:41.541
corresponds this idea of template matching

00:58:41.541 --> 00:58:43.364
and learning a single
template for each category

00:58:43.364 --> 00:58:44.922
in your data.

00:58:44.922 --> 00:58:48.339
And then once we have this trained matrix

00:58:49.485 --> 00:58:52.499
you can use it to actually
go and get your scores

00:58:52.499 --> 00:58:55.738
for any new training example.

00:58:55.738 --> 00:58:58.104
But what we have not told you is

00:58:58.104 --> 00:59:00.597
how do you actually go
about choosing the right W

00:59:00.597 --> 00:59:01.951
for your dataset.

00:59:01.951 --> 00:59:03.908
We've just talked about
what is the functional form

00:59:03.908 --> 00:59:06.907
and what is going on with this thing.

00:59:06.907 --> 00:59:11.086
So that's something we'll
really focus on next time.

00:59:11.086 --> 00:59:12.933
And next lecture we'll talk about

00:59:12.933 --> 00:59:14.668
what are the strategies and algorithms

00:59:14.668 --> 00:59:16.581
for choosing the right W.

00:59:16.581 --> 00:59:17.708
And this will lead us to questions

00:59:17.708 --> 00:59:19.544
of loss functions and optimization

00:59:19.544 --> 00:59:21.322
and eventually ConvNets.

00:59:21.322 --> 00:59:25.044
So, that's a bit of the
preview for next week.

00:59:25.044 --> 00:00:00.000
And that's all we have for today.